forked from SarahU3/data-science-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
divorce_analysis.Rmd
961 lines (761 loc) · 53.4 KB
/
divorce_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
---
title: "Predicting Divorce"
author: "Celeste Chen and Sarah Unbehaun"
date: "May 8, 2017"
output:
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(gridExtra)
library(plyr)
library(rpart)
library(rpart.plot)
library(Hmisc)
library(DMwR)
library(randomForest)
library(plotROC)
library(factoextra)
library(NbClust)
library(cluster)
```
```{r data, include=FALSE}
load("GSSdivorce.rda")
```
## Project Overview
Given the evolving discourse and perspectives on marriage, we seek to predict the probability that a couple will get a divorce based factors such as social views; attitudes about pornography, premarital sex or extra-marital sex; and fundamentalism of religious views. Using data from the General Social Survey, we will train an algorithm on a generational cohort from the years 1996-2014 to determine which types of respondents are most likely to experience divorce. We will test several methods of analyzing the data for predictions: (1) logistic regression, (2) decision trees, (3) random forest and (4) clustering. Based on the accuracy results of each method, we will determine the best model for prediction.
## Data
Our dataset will be taken from the General Social Survey (GSS) and focus on observations from 1996 to 2014. According to the [GSS website](http://gss.norc.org/About-The-GSS)
>The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States.
We have extracted the variables on the following characteristics for our analysis: opinions on sex education in the public schools; region of interview; opinions on premarital sex; opinions on extramarital sex; views on pornography laws; marital status; political party ID; how fundamental is respondent; number of children; age of respondent; ever been widowed; highest educational degree achieved; respondent's labor force status; whether the respondent thinks of self as liberal or conservative; general self-reported happiness; can people be trusted; self-reported class; self-reported income; region at age 16; who the respondent lived with at age 16; religion at age 16; frequency of church attendance; does the respondent believe other people would try to take advantage or be fair; does the respondent believe other people usually try to be helpful; size of city or town of residence; confidence in scientific community; self-reported job satisfaction; elf-reported satisfaction with financial situation;
whether abortions should be legal for married women who don't want more children; whether abortions should be legal if the woman is single and doesn't want to marry; and whether the respondent has seen an x-rated movie in the last year.
Not all of these questions were asked of all respondents. Three ballots were administered, so our data had to be analyzed in groups corresponding to each ballot. We also restricted our data to those who were married or "split," split being those who were either divorced or separated. The number of observations for each category are shown below:
Ballot | A | B | C | Total
--------|------|------|------|-------
Married | 4008 | 4025 | 4054 | 12853
Split | 1607 | 1665 | 1667 | 5238
Total | 5615 | 5700 | 5721 | 18091
```{r missing, include=FALSE, cache=TRUE}
#Restrict to 1996-2014 and married, divorced or separated
GSS.divorce <- GSS.divorce[GSS.divorce$year>=1996, ]
GSS.divorce <- GSS.divorce[GSS.divorce$marital=="MARRIED" | GSS.divorce$marital=="DIVORCED" | GSS.divorce$marital=="SEPARATED", ]
GSS.divorce$marital <- factor(GSS.divorce$marital)
#Collapse marital status to married or split
GSS.divorce$marital2 <- "married"
GSS.divorce$marital2 [GSS.divorce$marital == "SEPARATED"] <- "split"
GSS.divorce$marital2 [GSS.divorce$marital == "DIVORCED"] <- "split"
GSS.divorce$marital2 [is.na(GSS.divorce$marital) ==T] <- NA
GSS.divorce$marital2 <- factor(GSS.divorce$marital2)
GSSnew <- GSS.divorce[!is.na(GSS.divorce$marital2),]
summary(GSSnew$marital2)
table(GSSnew$marital2, GSSnew$ballot)
#Create categorical variable for size of place
GSSnew$sizecat <- NA
GSSnew$sizecat [GSSnew$size < 10] <- "rural"
GSSnew$sizecat [GSSnew$size >=10 & GSSnew$size <= 99] <- "small"
GSSnew$sizecat [GSSnew$size >=100 & GSSnew$size <= 999] <- "medium"
GSSnew$sizecat [GSSnew$size >=1000 & GSSnew$size <= 9999] <- "large"
GSSnew$sizecat <- factor(GSSnew$sizecat)
str(GSSnew)
```
```{r visuals test,include=FALSE}
##Opinion on Premarital Sex
#Married vs Divorced vs Separated vs Other
tab <- as.data.frame(table(GSSnew$marital2, GSSnew$premarsx))
colnames(tab) <- c("status", "premarital", "count")
pie_1b <- ggplot(tab, aes(status, premarital)) +
geom_point(aes(size = count), colour = "navy") +
theme_bw() + xlab("Marital Status?") + ylab("Opinion on premarital sex?") + scale_size_continuous(range=c(3,15)) + theme(plot.title = element_text(size = 10), legend.position="none") + coord_fixed(ratio = 1)
##How happy are you?
tab <- as.data.frame(table(GSSnew$marital2, GSSnew$happy))
colnames(tab) <- c("status", "happy", "count")
pie_5b <- ggplot(tab, aes(status, happy)) +
geom_point(aes(size = count), colour = "navy") +
theme_bw() + xlab("Marital Status?") + ylab("How happy are you?") + scale_size_continuous(range=c(3,15)) + theme(plot.title = element_text(size = 10), legend.position="none") + coord_fixed(ratio = 1)
#Sex Education in Schools
SexEducGraph2 <- ggplot(data=subset(GSSnew, !is.na(sexeduc)), aes(x= sexeduc, group=marital2)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", vjust = -.5) +
labs(y = "Percent", fill="sexeduc") +
facet_grid(~marital2) +
scale_y_continuous(labels = scales::percent)
##Can people be trusted?
TrustGraph2 <- ggplot(data=subset(GSSnew, !is.na(trust)), aes(x= trust, group=marital2)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", vjust = -.5) +
labs(y = "Percent", fill="trust") +
facet_grid(~marital2) + scale_y_continuous(labels = scales::percent) + theme(axis.text.x=element_text(angle = -90, hjust = 0))
##Live with both parents at age 16
tab <- as.data.frame(table(GSSnew$marital2, GSSnew$family16))
colnames(tab) <- c("status", "family16", "count")
pie_21b <- ggplot(tab, aes(status, family16)) +
geom_point(aes(size = count), colour = "navy") +
theme_bw() + xlab("Marital Status?") + ylab("Live with both parents at age 16?") + scale_size_continuous(range=c(3,15)) + theme(plot.title = element_text(size = 10), legend.position="none") + coord_fixed(ratio = 1)
## rescale to both parents, parent/stepparent, other???
familyGraph2 <- ggplot(GSSnew, aes(x= family16, group=marital2)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
geom_text(aes( label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", vjust = -.5) +
labs(y = "Percent", fill="family16") +
facet_grid(~marital2) +
scale_y_continuous(labels = scales::percent) + theme(axis.text.x=element_text(angle = -90, hjust = 0))
```
A few examples of the differences between the married and split groups can be seen in the graphs below. First, respondents who are split appear slightly more likely to say that premarital sex is not wrong at all. They are significantly less likely to say the they are very happy.
``` {r visuals, echo=FALSE, fig.height=4}
grid.arrange(pie_1b, pie_5b, ncol=2)
```
Also, respondents who are split are more likely to say that others cannot be trusted and slightly more likely to support sexual education in schools.
``` {r visuals 2, echo=FALSE, fig.height=4}
TrustGraph2
SexEducGraph2
```
``` {r splits, include=FALSE, cache=TRUE }
## Separate GSSnew by ballot type A, B or C
GSS.A <- subset(GSSnew, (GSSnew$version==1|GSSnew$version==4))
GSS.A <- subset(GSS.A, select = c(marital2, year, sexeduc, region, premarsx, xmarsex, partyid, fund, childs, degree, Agecat1, widowed, wrkstat, polviews, happy, class, income, reg16, family16, born, parborn, sizecat, attend, relig16, bible, satfin, abnomore, absingle, divlaw, fefam))
GSS.A.comp <- GSS.A[complete.cases(GSS.A),]
GSS.B <- subset(GSSnew, (GSSnew$version==2|GSSnew$version==5))
GSS.B <- subset(GSS.B, select = c(marital2, year, sexeduc, premarsx, region, pornlaw, partyid, fund, childs, degree, Agecat1, widowed, wrkstat, polviews, happy, trust, class, income, reg16, family16, born, parborn, sizecat, attend, relig16, bible, divlaw, xmovie, satfin, helpful, fair, satjob, consci, fefam))
GSS.B.comp <- GSS.B[complete.cases(GSS.B),]
GSS.C <- subset(GSSnew, (GSSnew$version==3|GSSnew$version==6))
GSS.C <- subset(GSS.C, select = c(year, region, xmarsex, pornlaw, partyid, fund, childs, degree, Agecat1, widowed, wrkstat, polviews, happy, trust, class, reg16, family16, born, parborn, income, fair, helpful, sizecat, attend, relig16, consci, satjob, satfin, abnomore, absingle, xmovie, marital2))
GSS.C.comp <- GSS.C[complete.cases(GSS.C),]
## Train-validate-test split for GSS Ballot A
set.seed(567)
rand <- runif(nrow(GSS.A.comp))
trainA <- GSS.A.comp[rand>0.3,]
validA <- GSS.A.comp[rand>0.15 & rand <0.3,]
testA <- GSS.A.comp[rand<0.15,]
## Train-validate-test split for GSS Ballot B
set.seed(678)
rand <- runif(nrow(GSS.B.comp))
trainB <- GSS.B.comp[rand>0.3,]
validB <- GSS.B.comp[rand>0.15 & rand <0.3,]
testB <- GSS.B.comp[rand<0.15,]
## Train-validate-test split for GSS Ballot B
set.seed(789)
rand <- runif(nrow(GSS.C.comp))
trainC <- GSS.C.comp[rand>0.3,]
validC <- GSS.C.comp[rand>0.15 & rand <0.3,]
testC <- GSS.C.comp[rand<0.15,]
```
## Analysis Method 1: Logistic Regression
We first tried logistic regression models, using all relevant variables for each ballot group as well as trying models using subsets of variables: demographics only, social/political views only, and removing non-statistically significant variables based on the the complete regression. Each ballot group was separated into train, validate and test groups. The overall accuracy for each model based on the test data, calculated as the total correct predictions divided by the total number of observations, is reported below. The accuracy was fairly similar across models.
Ballot | Model | Overall Accuracy
-------|---------------|-----------------
A | all variables | 0.75
A | demographics | 0.76
B | all variables | 0.74
B | stat. signif. | 0.74
C | all variables | 0.72
C | pol/social | 0.71
``` {r logit, include=FALSE, cache=TRUE}
###Logistic Regression
#glm(marital2 ~ <x>, data = <data>, family = binomial())
fitA <- glm(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw, family=binomial(link='logit'),data=trainA)
summary(fitA)
trainA$pred <- fitA$fitted.values
trainA$pred <- ifelse(trainA$pred > 0.5,1,0)
table(trainA$marital2, trainA$pred)
# accuracy overall (1547 + 331)/(1547+124+345+331) = .80
validA$pred <- predict(fitA, newdata = validA, type = "response")
validA$pred <- ifelse(validA$pred > 0.5,1,0)
table(validA$marital2, validA$pred)
# accuracy overall (304 + 69)/(304+42+90+69) = .74
# Test A with just demographic variables
fitA2 <- glm(marital2 ~ year + region + childs + degree + Agecat1 + widowed + wrkstat + class + income + family16 + born + parborn + sizecat, family=binomial(link='logit'),data=trainA)
trainA$pred2 <- fitA2$fitted.values
trainA$pred2 <- ifelse(trainA$pred2 > 0.5,1,0)
table(trainA$marital2, trainA$pred2)
# accuracy overall (1561 + 266)/(1561+110+410+266) = .78
validA$pred2 <- predict(fitA2, newdata = validA, type = "response")
validA$pred2 <- ifelse(validA$pred2 > 0.5,1,0)
table(validA$marital2, validA$pred2)
# accuracy (320+56)/(320+26+1-3+56) = .94
# significant increase is most likely a fluke, run on test data:
testA$pred2 <- predict(fitA2, newdata = testA, type = "response")
testA$pred2 <- ifelse(testA$pred2 > 0.5,1,0)
table(testA$marital2, testA$pred2)
# accuracy (345+48)/(345+31+96+48) + 0.76
# compare with running fitA on testA
testA$pred <- predict(fitA, newdata = testA, type = "response")
testA$pred <- ifelse(testA$pred > 0.5,1,0)
table(testA$marital2, testA$pred)
# accuracy = 0.75
## Ballot B
fitB <- glm(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satjob + satfin + xmovie + divlaw + fefam, family=binomial(link='logit'),data=trainB)
summary(fitB)
trainB$pred <- fitB$fitted.values
trainB$pred <- ifelse(trainB$pred > 0.5,1,0)
table(trainB$marital2, trainB$pred)
# accuracy overall (1293 + 266)/(1293+114+338+243) = .78
validB$pred <- predict(fitB, newdata = validB, type = "response")
validB$pred <- ifelse(validB$pred > 0.5,1,0)
table(validB$marital2, validB$pred)
# accuracy overall (254 + 60)/(254+31+77+60) = .74
testB$pred <- predict(fitB, newdata = testB, type = "response")
testB$pred <- ifelse(testB$pred > 0.5,1,0)
table(testB$marital2, testB$pred)
# (264+51)/(264+39+73+51) = 0.74
# Test B with only stat sig variables
fitB2 <- glm(marital2 ~ pornlaw + childs + degree + Agecat1 + wrkstat + happy + class + income + reg16 + attend + bible + satjob + xmovie + divlaw, family=binomial(link='logit'),data=trainB)
summary(fitB2)
trainB$pred2 <- fitB2$fitted.values
trainB$pred2 <- ifelse(trainB$pred2 > 0.5,1,0)
table(trainB$marital2, trainB$pred2)
# accuracy overall (1304+234)/(1304+103+347+234) = .77
# not any better predictor for split, only .60
validB$pred2 <- predict(fitB2, newdata = validB, type = "response")
validB$pred2 <- ifelse(validB$pred2 > 0.5,1,0)
table(validB$marital2, validB$pred2)
# accuracy overall (255+56)/(255+30+81+56) = .74
testB$pred2 <- predict(fitB2, newdata = testB, type = "response")
testB$pred2 <- ifelse(testB$pred2 > 0.5,1,0)
table(testB$marital2, testB$pred2)
# (271+47)/(271+32+77+47) = 0.74
## Ballot C
fitC <- glm(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + xmovie, family=binomial(link='logit'),data=trainC)
summary(fitC)
trainC$pred <- fitC$fitted.values
trainC$pred <- ifelse(trainC$pred > 0.5,1,0)
table(trainC$marital2, trainC$pred)
# accuracy overall (1292+109)/(1292+109+295+288) = .71
validC$pred <- predict(fitC, newdata = validC, type = "response")
validC$pred <- ifelse(validC$pred > 0.5,1,0)
table(validC$marital2, validC$pred)
# accuracy overall (259+46)/(259+42+80+46) = .71
testC$pred <- predict(fitC, newdata = testC, type = "response")
testC$pred <- ifelse(testC$pred > 0.5,1,0)
table(testC$marital2, testC$pred)
fitC2 <- glm(marital2 ~ xmarsex + pornlaw + partyid + fund + polviews + happy + trust + attend + relig16 + helpful + fair + consci + satjob + satfin + xmovie, family=binomial(link='logit'),data=trainC)
summary(fitC2)
trainC$pred2 <- fitC2$fitted.values
trainC$pred2 <- ifelse(trainC$pred2 > 0.5,1,0)
table(trainC$marital2, trainC$pred2)
# accuracy overall (1296+183)/(1296+105+400+183) = .75
validC$pred2 <- predict(fitC2, newdata = validC, type = "response")
validC$pred2 <- ifelse(validC$pred2 > 0.5,1,0)
table(validC$marital2, validC$pred2)
# accuracy overall (262+33)/(262+39+93+33) = .69
testC$pred2 <- predict(fitC2, newdata = testC, type = "response")
testC$pred2 <- ifelse(testC$pred2 > 0.5,1,0)
table(testC$marital2, testC$pred2)
# accuracy (288+28)/(288+35+94+28) = 0.71
```
## Analysis Method Two: Decision Trees
For both decision trees and random forests (below) we tested using both the complete cases and imputing data to fill in missing values. In the cases of both the complete cases and imputed data, decision tree models for all 3 ballots were more accurate than their random forest counterparts. For these initial runs, we tested our models on their corresponding training sets.
### Model: Decision Trees with Complete Cases: Initial Tests on Complete Cases Training Sets
We trained decision tree models on data from complete observations from Ballots A, B and C, and we then tested these models on their corresponding testing sets (and then compared the accuracy rates to those of the imputed models). As an example, the model based on the complete observations for Ballot A had an accuracy rate of 75.38%, with an accuracy rate in predicting married statuses of 78.31% versus an error in predicting split statuses of 59.76%.
``` {r decision trees complete, include=FALSE, cache=TRUE}
#Ballot A
rtree_fitA <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainA, method = 'class', cp = 0.00479512)
printcp(rtree_fitA)
rpart.plot(rtree_fitA, shadow.col="gray", nn=TRUE)
trainApred <- predict(rtree_fitA, trainA, type = 'class')
table(trainApred, trainA$marital2)
##Accuracy: (1570+319)/(1570+101+319+357)
###Accuracy: 80.49%
#Ballot B
rtree_fitB <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satjob + satfin + divlaw + xmovie + fefam, data = trainB, method = 'class', cp = 0.00519031)
printcp(rtree_fitB)
rpart.plot(rtree_fitB, shadow.col="gray", nn=TRUE)
trainBpred <- predict(rtree_fitB, trainB, type = 'class')
table(trainBpred, trainB$marital2)
##Accuracy: (1356+164)/(1356+164+417+51)
###Accuracy: 76.46%
#Ballot C
rtree_fitC <-rpart(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainC, method = 'class', cp = 0.00604491)
printcp(rtree_fitC)
rpart.plot(rtree_fitC, shadow.col="gray", nn=TRUE)
trainCpred <- predict(rtree_fitC, trainC, type = 'class')
table(trainCpred, trainC$marital2)
##Accuracy: (1305+287)/(1305+287+296+96)
##Accuracy: 80.24%
##Source: https://www.r-bloggers.com/using-decision-trees-to-predict-infant-birth-weights/
####Decision trees: more error in predicting married (2/3), more accurate with predicting split (4/5)
```
```{r knn imputation, include=FALSE, cache=TRUE}
#using DMwR package
## Ballot A
knnOutputA <- knnImputation(GSS.A[, !names(GSS.A) %in% "medv"]) # perform knn imputation.
anyNA(knnOutputA)
summary(knnOutputA)
## Ballot B
knnOutputB <- knnImputation(GSS.B[, !names(GSS.B) %in% "medv"]) # perform knn imputation.
anyNA(knnOutputB)
summary(knnOutputB)
## Ballot C
knnOutputC <- knnImputation(GSS.C[, !names(GSS.C) %in% "medv"]) # perform knn imputation.
anyNA(knnOutputC)
summary(knnOutputC)
##Separate into train, validate and test
## Ballot A
set.seed(567)
rand <- runif(nrow(knnOutputA))
trainAknn <- knnOutputA[rand>0.3,]
validAknn <- knnOutputA[rand>0.15 & rand <0.3,]
testAknn <- knnOutputA[rand<0.15,]
## Ballot B
set.seed(678)
rand <- runif(nrow(knnOutputB))
trainBknn <- knnOutputB[rand>0.3,]
validBknn <- knnOutputB[rand>0.15 & rand <0.3,]
testBknn <- knnOutputB[rand<0.15,]
## Ballot C
set.seed(789)
rand <- runif(nrow(knnOutputC))
trainCknn <- knnOutputC[rand>0.3,]
validCknn <- knnOutputC[rand>0.15 & rand <0.3,]
testCknn <- knnOutputC[rand<0.15,]
```
### Model: Decision Trees with Imputed Values: Initial Tests on Imputed Data Training Sets
We trained decision tree models on imputed data for Ballots A, B, C and then tested them. As with the random forest models, the decision model based on Ballot C data with imputed observations was most accurate. It had an overall error rate of 78.73%. The tree below shows that the splits were based on income, happiness, degree, age, region at age 16, work status, financial satisfaction, family life at age 16, church attendance, and political views.
``` {r imputed decision trees, include=FALSE, cache=TRUE}
###Decision Trees
## Ballot A
rtree_fitAknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainAknn, method = 'class', cp = 0.00435920)
printcp(rtree_fitAknn)
rpart.plot(rtree_fitAknn, shadow.col="gray", nn=TRUE)
trainApredknn <- predict(rtree_fitAknn, trainAknn, type = 'class')
table(trainApredknn, trainAknn$marital2)
##Accuracy: (2592+379)/(2592+379+768+170) = ###Accuracy: 76.00%
#Ballot B
rtree_fitBknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainBknn, method = 'class', cp = 0.00519031)
printcp(rtree_fitBknn)
rpart.plot(rtree_fitBknn, shadow.col="gray", nn=TRUE)
trainBpredknn <- predict(rtree_fitBknn, trainBknn, type = 'class')
table(trainBpredknn, trainBknn$marital2)
##Accuracy: (2699+311)/(2699+311+845+122) = ###Accuracy: 75.69%
#Ballot C
rtree_fitCknn <-rpart(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainCknn, method = 'class', cp = 0.00388601)
printcp(rtree_fitCknn)
rpart.plot(rtree_fitCknn, shadow.col="gray", nn=TRUE)
trainCpredknn <- predict(rtree_fitCknn, trainCknn, type = 'class')
table(trainCpredknn, trainCknn$marital2)
##Accuracy: (2702+448)/(2702+448+710+141) = 78.73%
```
```{r Imputed Ballot C, echo=FALSE, warning=FALSE, message=FALSE}
rpart.plot(rtree_fitCknn, shadow.col="gray", nn=TRUE)
```
### Decision Trees: Comparing Predictive Accuracy of Imputed and Complete Case Decision Tree Models on Complete Case Test Sets
The most accurate model that predicts split (divorce/separate) outcomes is the imputed decision tree model based on Ballot C. The model's overall accuracy rate in predicting marital status is 75.51%. It has a 78.16% accuracy rate at predicting married statuses, and a 60% accuracy rate at predicting splits.
imputed Ballot C | predicted marriage | predicted split | accuracy
-----------------|---------------------|-----------------|----------
married |297 |83 | 0.78
split |26 |39 | 0.60
```{r test models on complete data, include=FALSE, cache=TRUE, message=FALSE, warning=FALSE}
#Ballot A
###Complete on Complete
rtree_fitAcc <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainA, method = 'class', cp = 0.00739645)
printcp(rtree_fitAcc)
rpart.plot(rtree_fitAcc, shadow.col="gray", nn=TRUE)
trainApredcc <- predict(rtree_fitAcc, testA, type = 'class')
table(trainApredcc, testA$marital2)
##Overall Accuracy: (343+49)/(343+49_95+33)= 75.38%
##Married Accuracy: 343/(343+95)= 78.31%
##Split Accuracy:49/(82)=59.76%
##Imputed on Complete
rtree_fitAknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainAknn, method = 'class', cp = 0.00435920)
printcp(rtree_fitAknn)
rpart.plot(rtree_fitAknn, shadow.col="gray", nn=TRUE)
trainApredknnonc <- predict(rtree_fitAknn, testA, type = 'class')
table(trainApredknnonc, testA$marital2)
##Overall Accuracy: (349+40)/(349+104+27+40)=74.81%
##Married Accuracy: 349/(349+104)=77.04%
##Split Accuracy: 40/67 = 59.70%
#Ballot B
#Complete on Complete
rtree_fitBcc <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainB, method = 'class', cp = 0.00681431)
printcp(rtree_fitBcc)
rpart.plot(rtree_fitBcc, shadow.col="gray", nn=TRUE)
trainBpredcc <- predict(rtree_fitBcc, testB, type = 'class')
table(trainBpredcc, testB$marital2)
##Overall Accuracy: (274+30)/(274+30+29+94)=71.19%
##Married accuracy: (274)/(274+94) = 74.46%
##Split accuracy: 30/59 = 50.85%
##Imputed on Complete
rtree_fitBknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainBknn, method = 'class', cp = 0.00201845)
printcp(rtree_fitBknn)
rpart.plot(rtree_fitBknn, shadow.col="gray", nn=TRUE)
trainBpredknnonc <- predict(rtree_fitBknn, testB, type = 'class')
table(trainBpredknnonc, testB$marital2)
##Accuracy: (272+40)/(272+84+31+40) = 73.07%
##Married accuracy: 272/(272+84) = 76.40%
##Split accuracy: 40/71 = 56.34%
#Ballot C
#Complete on Complete
rtree_fitCcc <-rpart(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainC, method = 'class', cp = 0.00681431)
printcp(rtree_fitCcc)
rpart.plot(rtree_fitCcc, shadow.col="gray", nn=TRUE)
trainCpredcc <- predict(rtree_fitCcc, testC, type = 'class')
table(trainCpredcc, testC$marital2)
##Accuracy: (279+45)/(279+77+44+45)= 72.81%
##Married accuracy:279/(279+77)= 78.38%
##Split accuracy: 45/89= 50.56%
#Imputed on Complete
rtree_fitCknn <-rpart(marital2 ~ year + region + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle, data = trainCknn, method = 'class', cp = 0.00435920)
printcp(rtree_fitCknn)
ImpTreeC <- rpart.plot(rtree_fitCknn, shadow.col="gray", nn=TRUE)
trainCpredknn <- predict(rtree_fitCknn, testC, type = 'class')
table(trainCpredknn, testC$marital2)
##Accuracy: (297+39)/(297+83+26+39)=75.51%
##Married accuracy:297/(297+83)=78.16%
##Split accuracy: 39/(26+39)=60%
```
## Analysis Method Three: Random Forests
### Random Forests with Complete Cases
We trained a set of random forest models on data from complete observations from Ballots A, B and C, and we then tested these models on their corresponding validation sets. The model based on the complete observations for Ballot C had an overall error rate of 23.79%, with an error in predicting married statuses of 5.64% vs. an error in predicting split statuses of 67.41%.
```{r rf complete cases, include=FALSE, cache=TRUE}
######Train-Test
## Train-validate-test split for GSS Ballot A
forestAcomplete <- randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainA)
summary(trainA)
##Measuring error rate:
forestAcomplete
## OOB estimate of error rate: 24.51%
#Variable importance
print(importance(forestAcomplete, type = 2))
#Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#,
#degree, age ...etc.
plot(forestAcomplete)
#Search for most optimal number of input features
forestAcomplete.tune <- tuneRF(trainA[,-1], trainA$marital2, ntreeTry = 500,
mtryStart = 1, stepFactor = 2,
improve = 0.001, trace = TRUE, plot = TRUE)
##lowest OOB error: mtry = 4 with an OOB error of 24.99% --> thus, 4 variables split = optimal
forestAcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainA, mtry=4, ntree=500,
keep.forest=TRUE, importance=TRUE,test=validA)
forestAcomplete.tune
#OOB estimate of error rate is 25.21%; married = 6.39% vs. split: 68.34%
#Predict values for train set
pred.rf.trainAcomplete <- predict(forestAcomplete.tune, newdata = trainA, type='prob')[,2]
#Predict values for validating set (or test?)
pred.rf.validAcomplete <- predict(forestAcomplete, validA, type='prob')[,2]
#Set up ROC inputs
input.rfAcomplete <- rbind(data.frame(model = "trainAcomplete", d = trainA$marital2, m = pred.rf.trainAcomplete),
data.frame(model = "validAcomplete", d = validA$marital2, m = pred.rf.validAcomplete))
#Graph all three ROCs
roc.rfAcomplete <- ggplot(input.rfAcomplete, aes(d = d, model = model, m = m, colour = model)) +
geom_roc(show.legend = TRUE) + style_roc() +ggtitle("TrainAcomplete")
#AUC
calc_auc(roc.rfAcomplete)
importance(forestAcomplete)
varImpPlot(forestAcomplete)
forestBcomplete <- randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainB)
##Measuring error rate:
forestBcomplete
## OOB estimate of error rate: 26.85%
##6.23% error rate with guessing married, 72.6% error rate with guessing split
#Variable importance
print(importance(forestBcomplete, type = 2))
#Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#,
#degree, age ...etc.
plot(forestBcomplete)
#Search for most optimal number of input features
forestBcomplete.tune <- tuneRF(trainB[,-1], trainB$marital2, ntreeTry = 500,
mtryStart = 1, stepFactor = 2,
improve = 0.001, trace = TRUE, plot = TRUE)
##lowest OOB error: mtry = 16 with an OOB error of 25.74% --> thus, 16 variables split = optimal
forestBcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainB, mtry=16, ntree=500,
keep.forest=TRUE, importance=TRUE,test=validB) #valid instead of test group
forestBcomplete.tune
##OOB estimate of error rate is 27.01%; married = 8.92% vs. split: 67.12%
#Predict values for train set
pred.rf.trainBcomplete <- predict(forestBcomplete.tune, newdata = trainB, type='prob')[,2]
#Predict values for validating set
pred.rf.validBcomplete <- predict(forestBcomplete, validB, type='prob')[,2]
#Set up ROC inputs
input.rfBcomplete <- rbind(data.frame(model = "trainB", d = trainB$marital2, m = pred.rf.trainBcomplete),
data.frame(model = "validB", d = validB$marital2, m = pred.rf.validBcomplete))
#Graph all three ROCs
roc.rfBcomplete <- ggplot(input.rfBcomplete, aes(d = d, model = model, m = m, colour = model)) +
geom_roc(show.legend = TRUE) + style_roc() +ggtitle("TrainB")
#AUC
calc_auc(roc.rfBcomplete)
importance(forestBcomplete)
varImpPlot(forestBcomplete)
forestCcomplete <- randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle +xmovie, data = trainC)
##Measuring error rate:
forestCcomplete
## OOB estimate of error rate: 30.04%
##9.82% error rate with guessing married, 67.78% error rate with guessing split
#Variable importance
print(importance(forestCcomplete, type = 2))
#Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#,
#degree, age ...etc.
plot(forestCcomplete)
#Search for most optimal number of input features (began throwing error, but previous iterations and parital results provide mtry=4)
#forestCcomplete.tune <- tuneRF(trainC[,-1], trainC$marital2, ntreeTry = 500,
#mtryStart = 1, stepFactor = 2,
#improve = 0.001, trace = TRUE, plot = TRUE)
##lowest OOB error: mtry = 4 with an OOB error of 28.1% --> thus, 4 variables split = optimal
forestCcomplete.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainC, mtry=4, ntree=500,
keep.forest=TRUE, importance=TRUE,test=validC)
forestCcomplete.tune
##OOB estimate of error rate is 29.84%; married = 8.63% vs. split: 69.44%
#Predict values for train set
pred.rf.trainCcomplete <- predict(forestCcomplete.tune, newdata = trainC, type='prob')[,2]
#Predict values for valid. set
pred.rf.validCcomplete <- predict(forestCcomplete, validC, type='prob')[,2]
#Set up ROC inputs
input.rfCcomplete <- rbind(data.frame(model = "trainCcomplete", d = trainC$marital2, m = pred.rf.trainCcomplete),
data.frame(model = "validC", d = validC$marital2, m = pred.rf.validCcomplete))
#Graph all three ROCs
roc.rfCcomplete <- ggplot(input.rfCcomplete, aes(d = d, model = model, m = m, colour = model)) +
geom_roc(show.legend = TRUE) + style_roc() +ggtitle("TrainCcomplete")
#AUC
calc_auc(roc.rfCcomplete)
importance(forestCcomplete)
varImpPlot(forestCcomplete)
```
### Random Forests with Imputed Values
Just as we did with the models based on the complete observations, we trained random forest models on imputed data for Ballots A, B, C and then tested them on their corresponding validation sets. Again, the model based on Ballot C data with imputed observations was most accurate. It had an overall error rate of 24.6%, with an error in correctly predicting married respondents of 5.5% and an error in correctly predicting split respondents of 70.5%.
```{r rf imputed, include=FALSE, cache=TRUE}
####Trying Random Forest with condensed variables
forestA <- randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainAknn)
summary(trainAknn)
##Measuring error rate:
forestA
## OOB estimate of error rate: 24.51%
#Variable importance
print(importance(forestA, type = 2))
#Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#,
#degree, age ...etc.
plot(forestA)
#Search for most optimal number of input features
forestA.tune <- tuneRF(trainAknn[,-1], trainAknn$marital2, ntreeTry = 500,
mtryStart = 1, stepFactor = 2,
improve = 0.001, trace = TRUE, plot = TRUE)
##lowest OOB error: mtry = 8 with an OOB error of 23.94% --> thus, 8 variables split = optimal
forestA.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainAknn, mtry=8, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testAknn)
forestA.tune
##Married: OOB estimate of error rate is 24.15%; married = 6.15% vs. split: 65.65%
#Predict values for train set
pred.rf.trainA <- predict(forestA.tune, newdata = trainAknn, type='prob')[,2]
#Predict values for test set
pred.rf.testA <- predict(forestA, testAknn, type='prob')[,2]
#Set up ROC inputs
input.rfA <- rbind(data.frame(model = "trainA", d = trainAknn$marital2, m = pred.rf.trainA),
data.frame(model = "testA", d = testAknn$marital2, m = pred.rf.testA))
#Graph all three ROCs
roc.rfA <- ggplot(input.rfA, aes(d = d, model = model, m = m, colour = model)) +
geom_roc(show.legend = TRUE) + style_roc() +ggtitle("TrainA")
#AUC
calc_auc(roc.rfA)
importance(forestA)
varImpPlot(forestA)
## Ballot B
forestB <- randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainBknn)
summary(trainBknn)
##Measuring error rate:
forestB
## OOB estimate of error rate: 25.57%
##6.13% error rate with guessing married, 73% error rate with guessing split
#Variable importance
print(importance(forestB, type = 2))
#Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#,
#degree, age ...etc.
plot(forestB)
#Search for most optimal number of input features
#forestB.tune <- tuneRF(trainBknn[,-1], trainBknn$marital2, ntreeTry = 500,
#mtryStart = 1, stepFactor = 2,
#improve = 0.001, trace = TRUE, plot = TRUE)
##lowest OOB error: mtry = 16 with an OOB error of 18.88% --> thus, 16 variables split = optimal
forestB.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainBknn, mtry=16, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testBknn)
forestB.tune
##Married: OOB estimate of error rate is 25.19%; married = 7.37% vs. split: 68.69%
#plotROC
#Predict values for train set
pred.rf.trainB <- predict(forestB.tune, newdata = trainBknn, type='prob')[,2]
#Predict values for test set
pred.rf.testB <- predict(forestB, testBknn, type='prob')[,2]
#Set up ROC inputs
input.rfB <- rbind(data.frame(model = "trainB", d = trainBknn$marital2, m = pred.rf.trainB),
data.frame(model = "testB", d = testBknn$marital2, m = pred.rf.testB))
#Graph all three ROCs
roc.rfB <- ggplot(input.rfB, aes(d = d, model = model, m = m, colour = model)) +
geom_roc(show.legend = TRUE) + style_roc() +ggtitle("TrainB")
#AUC
calc_auc(roc.rfB)
importance(forestB)
varImpPlot(forestB)
## Ballot C
forestC <- randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainCknn)
summary(trainCknn)
##Measuring error rate:
forestC
## OOB estimate of error rate: 24.09%
##5.42% error rate with guessing married, 69.95% error rate with guessing split
#Variable importance
print(importance(forestC, type = 2))
#Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#,
#degree, age ...etc.
plot(forestC)
#Search for most optimal number of input features
#forestC.tune <- tuneRF(trainCknn[,-1], trainCknn$marital2, ntreeTry = 500,
#mtryStart = 1, stepFactor = 2,
#improve = 0.001, trace = TRUE, plot = TRUE)
##lowest OOB error: mtry = 8 with an OOB error of 14.3% --> thus, 8 variables split = optimal
forestC.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainCknn, mtry=8, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testCknn)
forestC.tune
##Married: OOB estimate of error rate is 23.27%; married = 6.30% vs. split: 64.94%
#Predict values for train set
pred.rf.trainC <- predict(forestC.tune, newdata = trainCknn, type='prob')[,2]
#Predict values for test set
pred.rf.testC <- predict(forestC, testCknn, type='prob')[,2]
#Set up ROC inputs
input.rfC <- rbind(data.frame(model = "trainC", d = trainCknn$marital2, m = pred.rf.trainC),
data.frame(model = "testC", d = testCknn$marital2, m = pred.rf.testC))
#Graph all three ROCs
roc.rfC <- ggplot(input.rfC, aes(d = d, model = model, m = m, colour = model)) +
geom_roc(show.legend = TRUE) + style_roc() +ggtitle("TrainC")
#AUC
calc_auc(roc.rfC)
importance(forestC)
varImpPlot(forestC)
#Source: http://dataaspirant.com/2017/01/02/k-nearest-neighbor-classifier-implementation-r-scratch/
```
### Random Forests: Comparing Results
Finally, we compared our random forest models by testing them on the test (rather than validation) sets corresponding to the complete observations for Ballots A, B and C. Here, the random forest model based on imputed data for Ballot C was most accurate. It had an overall error rate of 23.87% -- overally, it predicted marital status with 77% accuracy. It predicted married respondents with a 95.5% accuracy rate (4.5% error rate) and split respondents with a 29.5% accuracy rate (70.5% error rate). From these results, our random forest model based on imputed data for Ballot C could be a useful tool for predicting whether/not a respondent is married (and perhaps be/stay married), but not very powerful with respect to divorce.
```{r rf complete model on test data, include=FALSE, cache=TRUE}
##Complete on Complete (use 'test' rather than 'validate')
## Train-validate-test split for GSS Ballot A
forestAcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainA, mtry=4, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testA)
forestAcomplete.tune
######OOB estimate of error rate is 25.17%; married = 6.52% vs. split: 67.9%
##Complete on Complete
## test for GSS Ballot B
forestBcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainB, mtry=16, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testB) #test group
forestBcomplete.tune
##OOB estimate of error rate is 26.17%; married = 8.2% vs. split: 65.93%
#######Ballot C
forestCcomplete.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainC, mtry=4, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testC)
forestCcomplete.tune
##OOB estimate of error rate is 29.84%; married = 8.63% vs. split: 69.44%
```
```{r rf imputed model on completedata, include=FALSE, cache=TRUE}
##Imputed on Complete
###BallotA
forestAimponcomp.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainAknn, mtry=8, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testA)
forestAimponcomp.tune
###Trained on imputed dataset, tested on complete dataset
##OOB estimate of error rate is 23.97%; married = 7.2% vs. split: 64.3%
##Imputed (train on imputed trainAknn and tested on testA) is more accurate for Ballot A
###BallotB
forestBimponcomp.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainBknn, mtry=16, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testB) #test group
forestBimponcomp.tune
##OOBestimate of error rate is 25.4%; married = 7.62% vs. split: 68.77%
##Imputed is more accurate overall for Ballot B, but for split category specifically, the forestBcomplete.tune is more accurate (than imputed) for Ballot B
###BallotC
forestCimponcomp.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainCknn, mtry=4, ntree=500,
keep.forest=TRUE, importance=TRUE,test=testC)
forestCimponcomp.tune
##OOB estimate of error rate: 23.92%: married = 4.78%; split: 70.9%
##Imputed is more accurate overall for Ballot C,but for the split category speficifically, training on the complete (forestCcomplete.tune) is more accurate
```
## Analysis Method Four: Clusters
We also tried analyzing the data using clustering methods. Though the initial tests to determine the optimal number of clusters (2) seemed promising, the clusters did not accurately predict married or split status. Clustering was ineffective regardless of the clustering and distance methods used -- kmeans, pam (partitioning around medoids), hierarchical, Ward, single, complete, maximum, Manhattan, etc.
```{r cluster setup, include=FALSE, cache = TRUE }
#using packages factoextra and NbClust and Cluster
# create binary matrix
gss.a.mat <- model.matrix(~0 + year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw, GSS.A.comp)
# now with B ballot
gss.b.mat <- model.matrix(~0 + year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satjob + satfin + xmovie + divlaw + fefam, GSS.B.comp)
# now with Ballot C
gss.c.mat <- model.matrix(~0 + year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, GSS.C.comp)
```
``` {r cluster A, include=FALSE, cache=TRUE}
#determine optimal number of clusters using method within-cluster sum of squares
# visualize as line ("elbow" method)
elbow.a <- fviz_nbclust(gss.a.mat, kmeans, method = "wss")
# (silhouette method)
sil.plot.a <- fviz_nbclust(gss.a.mat, kmeans, method = "silhouette")
# optimal cluster is 2!
# compute kmeans clusters
km.a <- kmeans(gss.a.mat, 2, 10)
# compute hierarchical clusters and check silhouette for validation
hc.a <- eclust(gss.a.mat, "hclust", k=2, method="ward.D2", graph=FALSE)
sil.a <- hc.a$silinfo
sil.a$clus.avg.widths
# not very good 0.44/0.42
# compute PAM clusters
pam.a <- pam(gss.a.mat, 2)
# visulize as silhouette plot
fviz_silhouette(pam.a)
# similar to hierarchical
#compare clusters to marital status
table(GSS.A.comp$marital2, km.a$cluster)
table(GSS.A.comp$marital2, hc.a$cluster)
table(GSS.A.comp$marital2, pam.a$cluster)
```
For example, using the data from Ballot A, an elbow plot and a silhouette plot both indicate that the number of clusters should be 2.
```{r cluster output A, echo=FALSE, warning=FALSE, message=FALSE}
grid.arrange(elbow.a, sil.plot.a, ncol=2)
```
However, none of kmeans, pam, and hierarchical clustering methods correctly categorizes by married and split. The following table shows the "accuracy" as calculcated by the number of married observations assigned to cluster 1 and the number of split observations assigned to cluster 2.
```{r cluster B and C, include=FALSE, cache=TRUE}
#determine optimal number of clusters using method within-cluster sum of squares
# visualize as line ("elbow" method)
elbow.b <- fviz_nbclust(gss.b.mat, kmeans, method = "wss")
# (silhouette method)
sil.plot.b <- fviz_nbclust(gss.b.mat, kmeans, method = "silhouette")
# optimal cluster is 2
# compute kmeans clusters
km.b <- kmeans(gss.b.mat, 2, 10)
# compute hierarchical clusters and check silhouette for validation
hc.b <- eclust(gss.b.mat, "hclust", k=2, method="complete", graph=FALSE)
sil.b <- hc.b$silinfo
sil.b$clus.avg.widths
# not very good ~0.47/~0.38
# compute PAM clusters
pam.b <- pam(gss.b.mat, 2)
# visulize as silhouette plot
fviz_silhouette(pam.b)
# similar to hierarchical
#compare clusters to marital status
table(GSS.B.comp$marital2, km.b$cluster)
table(GSS.B.comp$marital2, hc.b$cluster)
table(GSS.B.comp$marital2, pam.b$cluster)
#determine optimal number of clusters using method within-cluster sum of squares
# visualize as line ("elbow" method)
fviz_nbclust(gss.c.mat, kmeans, method = "wss")
# (silhouette method)
fviz_nbclust(gss.c.mat, kmeans, method = "silhouette")
# optimal cluster is 2!
# compute kmeans clusters
km.c <- kmeans(gss.c.mat, 2, 10)
# compute hierarchical clusters and check silhouette for validation
hc.c <- eclust(gss.c.mat, "hclust", k=2, method="complete", graph=FALSE)
sil.c <- hc.c$silinfo
sil.c$clus.avg.widths
# not very good ~0.28/~0.41
# compute PAM clusters
pam.c <- pam(gss.c.mat, 2)
# visulize as silhouette plot
fviz_silhouette(pam.c)
# similar to hierarchical
#compare clusters to marital status
table(GSS.C.comp$marital2, km.c$cluster)
table(GSS.C.comp$marital2, hc.c$cluster)
table(GSS.C.comp$marital2, pam.c$cluster)
```
Method | Overall accuracy
-------------|------------------
kmeans | 0.55
hierarchical | 0.48
PAM | 0.48
As another test, average silhouette widths for the hierarchical clusters was calculated, but were only 0.44 and 0.42 (close to 1 is ideal). The results from Ballots B and C were not any more accurate, indicating that clustering is not the optimal method for this analysis.
## Conclusion
In the end, we selected two different models as being the best predictors. In terms of predicting the likelihood of being married, the random forest model based on imputed data for Ballot C was most accurate. In terms of predicting the likelihood of being divorced, the decision tree model based on imputed data for Ballot C was most accurate.
Our analysis has limitations in its accuracy and its usefulness for policymakers or marriage counselors. Its accuracy is limited in that we only use data from one half of a couple. Our predictions would likely be much more accurate if we had data on both partners. Its usefulness may be limited in its application to younger generations if their patterns of marriage and divorce differ significantly from those of the people in our dataset. Despite this, our analysis provides some insights into characteristics that are related to an individual's likelihood of getting a divorce.