divorce_analysis.Rmd

---
title: "Predicting Divorce"
author: "Celeste Chen and Sarah Unbehaun"
date: "May 8, 2017"
output:
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(gridExtra)
library(plyr)
library(rpart)
library(rpart.plot)
library(Hmisc)
library(DMwR)
library(randomForest)
library(plotROC)
library(factoextra)
library(NbClust)
library(cluster)
```

```{r data, include=FALSE}
load("GSSdivorce.rda")
```
## Project Overview
Given the evolving discourse and perspectives on marriage, we seek to predict the probability that a couple will get a divorce based factors such as social views; attitudes about pornography, premarital sex or extra-marital sex; and fundamentalism of religious views.  Using data from the General Social Survey, we will train an algorithm on a generational cohort from the years 1996-2014 to determine which types of respondents are most likely to experience divorce. We will test several methods of analyzing the data for predictions: (1) logistic regression, (2) decision trees, (3) random forest and (4) clustering. Based on the accuracy results of each method, we will determine the best model for prediction.

## Data
Our dataset will be taken from the General Social Survey (GSS) and focus on observations from 1996 to 2014. According to the [GSS website](http://gss.norc.org/About-The-GSS) 
>The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. Altogether the GSS is the single best source for sociological and attitudinal trend data covering the United States.

We have extracted the variables on the following characteristics for our analysis: opinions on sex education in the public schools; region of interview; opinions on premarital sex; opinions on extramarital sex; views on pornography laws; marital status; political party ID; how fundamental is respondent; number of children; age of respondent; ever been widowed; highest educational degree achieved; respondent's labor force status; whether the respondent thinks of self as liberal or conservative; general self-reported happiness; can people be trusted; self-reported class; self-reported income; region at age 16; who the respondent lived with at age 16; religion at age 16; frequency of church attendance; does the respondent believe other people would try to take advantage or be fair; does the respondent believe other people usually try to be helpful; size of city or town of residence; confidence in scientific community; self-reported job satisfaction; elf-reported satisfaction with financial situation; 
whether abortions should be legal for married women who don't want more children; whether abortions should be legal if the woman is single and doesn't want to marry; and whether the respondent has seen an x-rated movie in the last year. 

Not all of these questions were asked of all respondents. Three ballots were administered, so our data had to be analyzed in groups corresponding to each ballot. We also restricted our data to those who were married or "split," split being those who were either divorced or separated. The number of observations for each category are shown below: 

Ballot  |  A   | B    | C    | Total
--------|------|------|------|-------
Married | 4008 | 4025 | 4054 | 12853
Split   | 1607 | 1665 | 1667 | 5238
Total   | 5615 | 5700 | 5721 | 18091

```{r missing, include=FALSE, cache=TRUE}
#Restrict to 1996-2014 and married, divorced or separated
GSS.divorce <- GSS.divorce[GSS.divorce$year>=1996, ]
GSS.divorce <- GSS.divorce[GSS.divorce$marital=="MARRIED" | GSS.divorce$marital=="DIVORCED" | GSS.divorce$marital=="SEPARATED", ]
GSS.divorce$marital <- factor(GSS.divorce$marital)

#Collapse marital status to married or split
GSS.divorce$marital2 <- "married"
GSS.divorce$marital2 [GSS.divorce$marital == "SEPARATED"] <- "split"
GSS.divorce$marital2 [GSS.divorce$marital == "DIVORCED"] <- "split"
GSS.divorce$marital2 [is.na(GSS.divorce$marital) ==T] <- NA
GSS.divorce$marital2 <- factor(GSS.divorce$marital2)
GSSnew <- GSS.divorce[!is.na(GSS.divorce$marital2),]
summary(GSSnew$marital2)
table(GSSnew$marital2, GSSnew$ballot)

#Create categorical variable for size of place
GSSnew$sizecat <- NA
GSSnew$sizecat [GSSnew$size < 10] <- "rural"
GSSnew$sizecat [GSSnew$size >=10 & GSSnew$size <= 99] <- "small"
GSSnew$sizecat [GSSnew$size >=100 & GSSnew$size <= 999] <- "medium"
GSSnew$sizecat [GSSnew$size >=1000 & GSSnew$size <= 9999] <- "large"
GSSnew$sizecat <- factor(GSSnew$sizecat)
str(GSSnew)
```

```{r visuals test,include=FALSE}
##Opinion on Premarital Sex 
#Married vs Divorced vs Separated vs Other
tab <- as.data.frame(table(GSSnew$marital2, GSSnew$premarsx))
colnames(tab) <- c("status", "premarital", "count")
pie_1b <- ggplot(tab, aes(status, premarital)) + 
              geom_point(aes(size = count), colour = "navy") + 
              theme_bw() + xlab("Marital Status?") + ylab("Opinion on premarital sex?") + scale_size_continuous(range=c(3,15)) + theme(plot.title = element_text(size = 10), legend.position="none")  + coord_fixed(ratio = 1)

##How happy are you? 
tab <- as.data.frame(table(GSSnew$marital2, GSSnew$happy))
colnames(tab) <- c("status", "happy", "count")
pie_5b <- ggplot(tab, aes(status, happy)) + 
              geom_point(aes(size = count), colour = "navy") + 
              theme_bw() + xlab("Marital Status?") + ylab("How happy are you?") + scale_size_continuous(range=c(3,15)) + theme(plot.title = element_text(size = 10), legend.position="none")  + coord_fixed(ratio = 1)


#Sex Education in Schools
SexEducGraph2 <- ggplot(data=subset(GSSnew, !is.na(sexeduc)), aes(x= sexeduc,  group=marital2)) + 
                  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
                  geom_text(aes( label = scales::percent(..prop..),
                                 y= ..prop.. ), stat= "count", vjust = -.5) +
                  labs(y = "Percent", fill="sexeduc") +
                  facet_grid(~marital2) +
                  scale_y_continuous(labels = scales::percent)

##Can people be trusted? 
TrustGraph2 <- ggplot(data=subset(GSSnew, !is.na(trust)), aes(x= trust,  group=marital2)) + 
                        geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
                        geom_text(aes( label = scales::percent(..prop..),
                                       y= ..prop.. ), stat= "count", vjust = -.5) +
                        labs(y = "Percent", fill="trust") +
                        facet_grid(~marital2) + scale_y_continuous(labels = scales::percent) + theme(axis.text.x=element_text(angle = -90, hjust = 0))


##Live with both parents at age 16
tab <- as.data.frame(table(GSSnew$marital2, GSSnew$family16))
colnames(tab) <- c("status", "family16", "count")
pie_21b <- ggplot(tab, aes(status, family16)) + 
              geom_point(aes(size = count), colour = "navy") + 
              theme_bw() + xlab("Marital Status?") + ylab("Live with both parents at age 16?") + scale_size_continuous(range=c(3,15)) + theme(plot.title = element_text(size = 10), legend.position="none")  + coord_fixed(ratio = 1)

## rescale to both parents, parent/stepparent, other???
familyGraph2 <- ggplot(GSSnew, aes(x= family16,  group=marital2)) + 
                        geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
                        geom_text(aes( label = scales::percent(..prop..),
                                       y= ..prop.. ), stat= "count", vjust = -.5) +
                        labs(y = "Percent", fill="family16") +
                        facet_grid(~marital2) +
                        scale_y_continuous(labels = scales::percent) + theme(axis.text.x=element_text(angle = -90, hjust = 0))


```
A few examples of the differences between the married and split groups can be seen in the graphs below. First, respondents who are split appear slightly more likely to say that premarital sex is not wrong at all. They are significantly less likely to say the they are very happy. 

``` {r visuals, echo=FALSE, fig.height=4}
grid.arrange(pie_1b, pie_5b, ncol=2)
```

Also, respondents who are split are more likely to say that others cannot be trusted and slightly more likely to support sexual education in schools. 


``` {r visuals 2, echo=FALSE, fig.height=4}
TrustGraph2


SexEducGraph2

```

``` {r splits, include=FALSE, cache=TRUE }
## Separate GSSnew by ballot type A, B or C

GSS.A <- subset(GSSnew, (GSSnew$version==1|GSSnew$version==4))
GSS.A <- subset(GSS.A, select = c(marital2, year, sexeduc, region, premarsx, xmarsex, partyid, fund, childs, degree, Agecat1, widowed, wrkstat, polviews, happy, class, income, reg16, family16, born, parborn, sizecat, attend, relig16, bible, satfin, abnomore, absingle, divlaw, fefam))
GSS.A.comp <- GSS.A[complete.cases(GSS.A),]
GSS.B <- subset(GSSnew, (GSSnew$version==2|GSSnew$version==5))
GSS.B <- subset(GSS.B, select = c(marital2, year, sexeduc, premarsx, region, pornlaw, partyid, fund, childs, degree, Agecat1, widowed, wrkstat, polviews, happy, trust, class, income, reg16, family16, born, parborn, sizecat, attend, relig16, bible, divlaw, xmovie, satfin, helpful, fair, satjob, consci, fefam))
GSS.B.comp <- GSS.B[complete.cases(GSS.B),]
GSS.C <- subset(GSSnew, (GSSnew$version==3|GSSnew$version==6))
GSS.C <- subset(GSS.C, select = c(year, region, xmarsex, pornlaw, partyid, fund, childs, degree, Agecat1, widowed, wrkstat, polviews, happy, trust, class, reg16, family16, born, parborn, income, fair, helpful, sizecat, attend, relig16, consci, satjob, satfin, abnomore, absingle, xmovie, marital2))
GSS.C.comp <- GSS.C[complete.cases(GSS.C),]

## Train-validate-test split for GSS Ballot A
set.seed(567)
rand <- runif(nrow(GSS.A.comp))

trainA <- GSS.A.comp[rand>0.3,]
validA <- GSS.A.comp[rand>0.15 & rand <0.3,]
testA <- GSS.A.comp[rand<0.15,]

## Train-validate-test split for GSS Ballot B
set.seed(678)
rand <- runif(nrow(GSS.B.comp))

trainB <- GSS.B.comp[rand>0.3,]
validB <- GSS.B.comp[rand>0.15 & rand <0.3,]
testB <- GSS.B.comp[rand<0.15,]

## Train-validate-test split for GSS Ballot B
set.seed(789)
rand <- runif(nrow(GSS.C.comp))

trainC <- GSS.C.comp[rand>0.3,]
validC <- GSS.C.comp[rand>0.15 & rand <0.3,]
testC <- GSS.C.comp[rand<0.15,]
```


## Analysis Method 1: Logistic Regression
We first tried logistic regression models, using all relevant variables for each ballot group as well as trying models using subsets of variables: demographics only, social/political views only, and removing non-statistically significant variables based on the the complete regression. Each ballot group was separated into train, validate and test groups. The overall accuracy for each model based on the test data, calculated as the total correct predictions divided by the total number of observations, is reported below. The accuracy was fairly similar across models. 

Ballot | Model         | Overall Accuracy 
-------|---------------|-----------------
A      | all variables | 0.75
A      | demographics  | 0.76
B      | all variables | 0.74
B      | stat. signif. | 0.74
C      | all variables | 0.72
C      | pol/social    | 0.71

``` {r logit, include=FALSE, cache=TRUE}
###Logistic Regression 
#glm(marital2 ~ <x>, data = <data>, family = binomial())
fitA <- glm(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw, family=binomial(link='logit'),data=trainA) 
summary(fitA)
trainA$pred <- fitA$fitted.values
trainA$pred <- ifelse(trainA$pred > 0.5,1,0)
table(trainA$marital2, trainA$pred)
# accuracy overall (1547 + 331)/(1547+124+345+331) = .80 
validA$pred <- predict(fitA, newdata = validA, type = "response")
validA$pred <- ifelse(validA$pred > 0.5,1,0)
table(validA$marital2, validA$pred)
# accuracy overall (304 + 69)/(304+42+90+69) = .74

# Test A with just demographic variables
fitA2 <- glm(marital2 ~ year + region + childs + degree + Agecat1 + widowed + wrkstat + class + income +  family16 + born + parborn + sizecat, family=binomial(link='logit'),data=trainA) 
trainA$pred2 <- fitA2$fitted.values
trainA$pred2 <- ifelse(trainA$pred2 > 0.5,1,0)
table(trainA$marital2, trainA$pred2)
# accuracy overall (1561 + 266)/(1561+110+410+266) = .78
validA$pred2 <- predict(fitA2, newdata = validA, type = "response")
validA$pred2 <- ifelse(validA$pred2 > 0.5,1,0)
table(validA$marital2, validA$pred2)
# accuracy (320+56)/(320+26+1-3+56) = .94
# significant increase is most likely a fluke, run on test data:
testA$pred2 <- predict(fitA2, newdata = testA, type = "response")
testA$pred2 <- ifelse(testA$pred2 > 0.5,1,0)
table(testA$marital2, testA$pred2)
# accuracy (345+48)/(345+31+96+48) + 0.76

# compare with running fitA on testA
testA$pred <- predict(fitA, newdata = testA, type = "response")
testA$pred <- ifelse(testA$pred > 0.5,1,0)
table(testA$marital2, testA$pred) 
# accuracy = 0.75


## Ballot B 
fitB <- glm(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satjob + satfin + xmovie + divlaw + fefam, family=binomial(link='logit'),data=trainB) 
summary(fitB)
trainB$pred <- fitB$fitted.values
trainB$pred <- ifelse(trainB$pred > 0.5,1,0)
table(trainB$marital2, trainB$pred)
# accuracy overall (1293 + 266)/(1293+114+338+243) = .78
validB$pred <- predict(fitB, newdata = validB, type = "response")
validB$pred <- ifelse(validB$pred > 0.5,1,0)
table(validB$marital2, validB$pred)
# accuracy overall (254 + 60)/(254+31+77+60) = .74
testB$pred <- predict(fitB, newdata = testB, type = "response")
testB$pred <- ifelse(testB$pred > 0.5,1,0)
table(testB$marital2, testB$pred)
# (264+51)/(264+39+73+51) = 0.74

# Test B with only stat sig variables
fitB2 <- glm(marital2 ~ pornlaw + childs + degree + Agecat1 + wrkstat + happy + class + income + reg16 + attend + bible + satjob + xmovie + divlaw, family=binomial(link='logit'),data=trainB) 
summary(fitB2)
trainB$pred2 <- fitB2$fitted.values
trainB$pred2 <- ifelse(trainB$pred2 > 0.5,1,0)
table(trainB$marital2, trainB$pred2)
# accuracy overall (1304+234)/(1304+103+347+234) = .77
# not any better predictor for split, only .60
validB$pred2 <- predict(fitB2, newdata = validB, type = "response")
validB$pred2 <- ifelse(validB$pred2 > 0.5,1,0)
table(validB$marital2, validB$pred2)
# accuracy overall (255+56)/(255+30+81+56) = .74
testB$pred2 <- predict(fitB2, newdata = testB, type = "response")
testB$pred2 <- ifelse(testB$pred2 > 0.5,1,0)
table(testB$marital2, testB$pred2)
# (271+47)/(271+32+77+47) = 0.74


## Ballot C
fitC <- glm(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + xmovie, family=binomial(link='logit'),data=trainC) 
summary(fitC)
trainC$pred <- fitC$fitted.values
trainC$pred <- ifelse(trainC$pred > 0.5,1,0)
table(trainC$marital2, trainC$pred)
# accuracy overall (1292+109)/(1292+109+295+288) = .71
validC$pred <- predict(fitC, newdata = validC, type = "response")
validC$pred <- ifelse(validC$pred > 0.5,1,0)
table(validC$marital2, validC$pred)
# accuracy overall (259+46)/(259+42+80+46) = .71
testC$pred <- predict(fitC, newdata = testC, type = "response")
testC$pred <- ifelse(testC$pred > 0.5,1,0)
table(testC$marital2, testC$pred)

fitC2 <- glm(marital2 ~ xmarsex + pornlaw + partyid + fund + polviews + happy + trust + attend + relig16 + helpful + fair + consci + satjob + satfin + xmovie, family=binomial(link='logit'),data=trainC)
summary(fitC2)
trainC$pred2 <- fitC2$fitted.values
trainC$pred2 <- ifelse(trainC$pred2 > 0.5,1,0)
table(trainC$marital2, trainC$pred2)
# accuracy overall (1296+183)/(1296+105+400+183) = .75
validC$pred2 <- predict(fitC2, newdata = validC, type = "response")
validC$pred2 <- ifelse(validC$pred2 > 0.5,1,0)
table(validC$marital2, validC$pred2)
# accuracy overall (262+33)/(262+39+93+33) = .69
testC$pred2 <- predict(fitC2, newdata = testC, type = "response")
testC$pred2 <- ifelse(testC$pred2 > 0.5,1,0)
table(testC$marital2, testC$pred2)
# accuracy (288+28)/(288+35+94+28) = 0.71

```

## Analysis Method Two: Decision Trees
For both decision trees and random forests (below) we tested using both the complete cases and imputing data to fill in missing values. In the cases of both the complete cases and imputed data, decision tree models for all 3 ballots were more accurate than their random forest counterparts. For these initial runs, we tested our models on their corresponding training sets.

### Model: Decision Trees with Complete Cases: Initial Tests on Complete Cases Training Sets
We trained decision tree models on data from complete observations from Ballots A, B and C, and we then tested these models on their corresponding testing sets (and then compared the accuracy rates to those of the imputed models). As an example, the model based on the complete observations for Ballot A had an accuracy rate of 75.38%, with an accuracy rate in predicting married statuses of 78.31% versus an error in predicting split statuses of 59.76%. 
``` {r decision trees complete, include=FALSE, cache=TRUE}

#Ballot A
rtree_fitA <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainA, method = 'class', cp = 0.00479512)
 printcp(rtree_fitA)
rpart.plot(rtree_fitA, shadow.col="gray", nn=TRUE)

trainApred <- predict(rtree_fitA, trainA, type = 'class')
table(trainApred, trainA$marital2)
##Accuracy: (1570+319)/(1570+101+319+357)
###Accuracy: 80.49%

#Ballot B
rtree_fitB <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satjob + satfin + divlaw + xmovie + fefam, data = trainB, method = 'class', cp = 0.00519031)
printcp(rtree_fitB)
rpart.plot(rtree_fitB, shadow.col="gray", nn=TRUE)
trainBpred <- predict(rtree_fitB, trainB, type = 'class')
table(trainBpred, trainB$marital2)
##Accuracy: (1356+164)/(1356+164+417+51)
###Accuracy: 76.46%


#Ballot C
rtree_fitC <-rpart(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainC, method = 'class', cp = 0.00604491)
printcp(rtree_fitC)
rpart.plot(rtree_fitC, shadow.col="gray", nn=TRUE)
trainCpred <- predict(rtree_fitC, trainC, type = 'class')
table(trainCpred, trainC$marital2)
##Accuracy: (1305+287)/(1305+287+296+96)
##Accuracy: 80.24%


##Source: https://www.r-bloggers.com/using-decision-trees-to-predict-infant-birth-weights/
####Decision trees: more error in predicting married (2/3), more accurate with predicting split (4/5)
```


```{r knn imputation, include=FALSE, cache=TRUE}
#using DMwR package
## Ballot A
knnOutputA <- knnImputation(GSS.A[, !names(GSS.A) %in% "medv"])  # perform knn imputation.
anyNA(knnOutputA) 
summary(knnOutputA)

## Ballot B
knnOutputB <- knnImputation(GSS.B[, !names(GSS.B) %in% "medv"])  # perform knn imputation.
anyNA(knnOutputB) 
summary(knnOutputB)

## Ballot C
knnOutputC <- knnImputation(GSS.C[, !names(GSS.C) %in% "medv"])  # perform knn imputation.
anyNA(knnOutputC) 
summary(knnOutputC)

##Separate into train, validate and test 
## Ballot A
set.seed(567)
rand <- runif(nrow(knnOutputA))

trainAknn <- knnOutputA[rand>0.3,]
validAknn <- knnOutputA[rand>0.15 & rand <0.3,]
testAknn <- knnOutputA[rand<0.15,]

## Ballot B
set.seed(678)
rand <- runif(nrow(knnOutputB))

trainBknn <- knnOutputB[rand>0.3,]
validBknn <- knnOutputB[rand>0.15 & rand <0.3,]
testBknn <- knnOutputB[rand<0.15,]

## Ballot C
set.seed(789)
rand <- runif(nrow(knnOutputC))

trainCknn <- knnOutputC[rand>0.3,]
validCknn <- knnOutputC[rand>0.15 & rand <0.3,]
testCknn <- knnOutputC[rand<0.15,]
```

### Model: Decision Trees with Imputed Values: Initial Tests on Imputed Data Training Sets
We trained decision tree models on imputed data for Ballots A, B, C and then tested them. As with the random forest models, the decision model based on Ballot C data with imputed observations was most accurate. It had an overall error rate of 78.73%. The tree below shows that the splits were based on income, happiness, degree, age, region at age 16, work status, financial satisfaction, family life at age 16, church attendance, and political views. 

``` {r imputed decision trees, include=FALSE, cache=TRUE}
###Decision Trees 

## Ballot A
rtree_fitAknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainAknn, method = 'class', cp =  0.00435920)
printcp(rtree_fitAknn)
rpart.plot(rtree_fitAknn, shadow.col="gray", nn=TRUE)
trainApredknn <- predict(rtree_fitAknn, trainAknn, type = 'class')
table(trainApredknn, trainAknn$marital2)
##Accuracy: (2592+379)/(2592+379+768+170) = ###Accuracy: 76.00%

#Ballot B
rtree_fitBknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainBknn, method = 'class', cp =  0.00519031)
printcp(rtree_fitBknn)
rpart.plot(rtree_fitBknn, shadow.col="gray", nn=TRUE)
trainBpredknn <- predict(rtree_fitBknn, trainBknn, type = 'class')
table(trainBpredknn, trainBknn$marital2)
##Accuracy: (2699+311)/(2699+311+845+122) = ###Accuracy: 75.69%

#Ballot C
rtree_fitCknn <-rpart(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainCknn, method = 'class', cp = 0.00388601)
printcp(rtree_fitCknn)
rpart.plot(rtree_fitCknn, shadow.col="gray", nn=TRUE)
trainCpredknn <- predict(rtree_fitCknn, trainCknn, type = 'class')
table(trainCpredknn, trainCknn$marital2)
##Accuracy: (2702+448)/(2702+448+710+141) = 78.73%


```

```{r Imputed Ballot C, echo=FALSE, warning=FALSE, message=FALSE}
rpart.plot(rtree_fitCknn, shadow.col="gray", nn=TRUE)
```

### Decision Trees: Comparing Predictive Accuracy of Imputed and Complete Case Decision Tree Models on Complete Case Test Sets 
The most accurate model that predicts split (divorce/separate) outcomes is the imputed decision tree model based on Ballot C. The model's overall accuracy rate in predicting marital status is 75.51%. It has a 78.16% accuracy rate at predicting married statuses, and a 60% accuracy rate at predicting splits. 


imputed Ballot C | predicted marriage  | predicted split | accuracy
-----------------|---------------------|-----------------|----------
married          |297                  |83               | 0.78 
split            |26                   |39               | 0.60


```{r test models on complete data, include=FALSE, cache=TRUE, message=FALSE, warning=FALSE}
#Ballot A
###Complete on Complete  
    rtree_fitAcc <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainA, method = 'class', cp = 0.00739645)
     printcp(rtree_fitAcc)
    rpart.plot(rtree_fitAcc, shadow.col="gray", nn=TRUE)
    trainApredcc <- predict(rtree_fitAcc, testA, type = 'class')
    table(trainApredcc, testA$marital2)
    ##Overall Accuracy:  (343+49)/(343+49_95+33)= 75.38%
    ##Married Accuracy: 343/(343+95)= 78.31%
    ##Split Accuracy:49/(82)=59.76%

##Imputed on Complete
  rtree_fitAknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainAknn, method = 'class', cp =  0.00435920)
     printcp(rtree_fitAknn)
    rpart.plot(rtree_fitAknn, shadow.col="gray", nn=TRUE)
    trainApredknnonc <- predict(rtree_fitAknn, testA, type = 'class')
    table(trainApredknnonc, testA$marital2)
    ##Overall Accuracy: (349+40)/(349+104+27+40)=74.81%
    ##Married Accuracy: 349/(349+104)=77.04%
    ##Split Accuracy: 40/67 = 59.70%
    
#Ballot B
#Complete on Complete        
    rtree_fitBcc <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainB, method = 'class', cp =  0.00681431)
    printcp(rtree_fitBcc)
    rpart.plot(rtree_fitBcc, shadow.col="gray", nn=TRUE)
    trainBpredcc <- predict(rtree_fitBcc, testB, type = 'class')
    table(trainBpredcc, testB$marital2)
    ##Overall Accuracy:  (274+30)/(274+30+29+94)=71.19%
    ##Married accuracy:  (274)/(274+94) = 74.46%
    ##Split accuracy:   30/59 = 50.85%

##Imputed on Complete
  rtree_fitBknn <-rpart(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainBknn, method = 'class', cp =  0.00201845)
     printcp(rtree_fitBknn)
    rpart.plot(rtree_fitBknn, shadow.col="gray", nn=TRUE)
    trainBpredknnonc <- predict(rtree_fitBknn, testB, type = 'class')
    table(trainBpredknnonc, testB$marital2)
    ##Accuracy: (272+40)/(272+84+31+40) = 73.07%
    ##Married accuracy: 272/(272+84) = 76.40%
    ##Split accuracy:  40/71 = 56.34%
    
#Ballot C    
  #Complete on Complete        
    rtree_fitCcc <-rpart(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainC, method = 'class', cp =  0.00681431)
    printcp(rtree_fitCcc)
    rpart.plot(rtree_fitCcc, shadow.col="gray", nn=TRUE)
    trainCpredcc <- predict(rtree_fitCcc, testC, type = 'class')
    table(trainCpredcc, testC$marital2)
    ##Accuracy: (279+45)/(279+77+44+45)= 72.81%
    ##Married accuracy:279/(279+77)= 78.38%
    ##Split accuracy:  45/89= 50.56%
    
    
#Imputed on Complete
rtree_fitCknn <-rpart(marital2 ~ year + region +  pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle, data = trainCknn, method = 'class', cp =  0.00435920)
     printcp(rtree_fitCknn)
    ImpTreeC <- rpart.plot(rtree_fitCknn, shadow.col="gray", nn=TRUE)
    trainCpredknn <- predict(rtree_fitCknn, testC, type = 'class')
    table(trainCpredknn, testC$marital2)
  ##Accuracy: (297+39)/(297+83+26+39)=75.51%
    ##Married accuracy:297/(297+83)=78.16%
    ##Split accuracy:  39/(26+39)=60%

```


## Analysis Method Three: Random Forests
### Random Forests with Complete Cases
We trained a set of random forest models on data from complete observations from Ballots A, B and C, and we then tested these models on their corresponding validation sets. The model based on the complete observations for Ballot C had an overall error rate of 23.79%, with an error in predicting married statuses of 5.64% vs. an error in predicting split statuses of 67.41%.

```{r rf complete cases, include=FALSE, cache=TRUE}
######Train-Test
## Train-validate-test split for GSS Ballot A
forestAcomplete <- randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainA)
summary(trainA)
##Measuring error rate: 
forestAcomplete
## OOB estimate of  error rate: 24.51%
#Variable importance
  print(importance(forestAcomplete, type = 2))
  #Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#, 
    #degree, age ...etc. 
 plot(forestAcomplete)
#Search for most optimal number of input features
  forestAcomplete.tune <- tuneRF(trainA[,-1], trainA$marital2, ntreeTry = 500, 
                     mtryStart = 1, stepFactor = 2, 
                     improve = 0.001, trace = TRUE, plot = TRUE)
  ##lowest OOB error: mtry = 4 with an OOB error of 24.99% --> thus, 4 variables split = optimal

forestAcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainA, mtry=4, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=validA)
  
  forestAcomplete.tune
  #OOB estimate of error rate is 25.21%; married = 6.39% vs. split: 68.34%
  
  #Predict values for train set
  pred.rf.trainAcomplete <- predict(forestAcomplete.tune, newdata = trainA, type='prob')[,2]

#Predict values for validating set (or test?)
  pred.rf.validAcomplete <- predict(forestAcomplete, validA, type='prob')[,2]

  #Set up ROC inputs
  input.rfAcomplete <- rbind(data.frame(model = "trainAcomplete", d = trainA$marital2, m = pred.rf.trainAcomplete), 
                  data.frame(model = "validAcomplete", d = validA$marital2,  m = pred.rf.validAcomplete))

#Graph all three ROCs
  roc.rfAcomplete <- ggplot(input.rfAcomplete, aes(d = d, model = model, m = m, colour = model)) + 
             geom_roc(show.legend = TRUE) + style_roc()  +ggtitle("TrainAcomplete")

#AUC
  calc_auc(roc.rfAcomplete)
  importance(forestAcomplete)
varImpPlot(forestAcomplete)
  
forestBcomplete <- randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainB)
##Measuring error rate: 
forestBcomplete
## OOB estimate of  error rate: 26.85%
  ##6.23% error rate with guessing married, 72.6% error rate with guessing split 

#Variable importance
  print(importance(forestBcomplete, type = 2))
  #Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#, 
    #degree, age ...etc. 
 plot(forestBcomplete)
#Search for most optimal number of input features
  forestBcomplete.tune <- tuneRF(trainB[,-1], trainB$marital2, ntreeTry = 500, 
                     mtryStart = 1, stepFactor = 2, 
                     improve = 0.001, trace = TRUE, plot = TRUE)
 
    ##lowest OOB error: mtry = 16 with an OOB error of 25.74% --> thus, 16 variables split = optimal

  forestBcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainB, mtry=16, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=validB) #valid instead of test group
  
  forestBcomplete.tune
  ##OOB estimate of error rate is 27.01%; married = 8.92% vs. split: 67.12%
  
  
  #Predict values for train set
  pred.rf.trainBcomplete <- predict(forestBcomplete.tune, newdata = trainB, type='prob')[,2]

#Predict values for validating set
  pred.rf.validBcomplete <- predict(forestBcomplete, validB, type='prob')[,2]

  #Set up ROC inputs
  input.rfBcomplete <- rbind(data.frame(model = "trainB", d = trainB$marital2, m = pred.rf.trainBcomplete), 
                  data.frame(model = "validB", d = validB$marital2,  m = pred.rf.validBcomplete))
  
#Graph all three ROCs
  roc.rfBcomplete <- ggplot(input.rfBcomplete, aes(d = d, model = model, m = m, colour = model)) + 
             geom_roc(show.legend = TRUE) + style_roc()  +ggtitle("TrainB")

#AUC
  calc_auc(roc.rfBcomplete)
  importance(forestBcomplete)
varImpPlot(forestBcomplete)

forestCcomplete <- randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle +xmovie, data = trainC)

##Measuring error rate: 
forestCcomplete
## OOB estimate of  error rate: 30.04%
##9.82% error rate with guessing married, 67.78% error rate with guessing split 

#Variable importance
  print(importance(forestCcomplete, type = 2))
  #Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#, 
    #degree, age ...etc. 
 plot(forestCcomplete)
#Search for most optimal number of input features (began throwing error, but previous iterations and parital results provide mtry=4)
  #forestCcomplete.tune <- tuneRF(trainC[,-1], trainC$marital2, ntreeTry = 500, 
                     #mtryStart = 1, stepFactor = 2, 
                     #improve = 0.001, trace = TRUE, plot = TRUE)
    ##lowest OOB error: mtry = 4 with an OOB error of 28.1% --> thus, 4 variables split = optimal

  forestCcomplete.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainC, mtry=4, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=validC)
  
  forestCcomplete.tune
  ##OOB estimate of error rate is 29.84%; married = 8.63% vs. split: 69.44%
  
  #Predict values for train set
  pred.rf.trainCcomplete <- predict(forestCcomplete.tune, newdata = trainC, type='prob')[,2]

#Predict values for valid. set
  pred.rf.validCcomplete <- predict(forestCcomplete, validC, type='prob')[,2]

  #Set up ROC inputs
  input.rfCcomplete <- rbind(data.frame(model = "trainCcomplete", d = trainC$marital2, m = pred.rf.trainCcomplete), 
                  data.frame(model = "validC", d = validC$marital2,  m = pred.rf.validCcomplete))
  
#Graph all three ROCs
  roc.rfCcomplete <- ggplot(input.rfCcomplete, aes(d = d, model = model, m = m, colour = model)) + 
             geom_roc(show.legend = TRUE) + style_roc()  +ggtitle("TrainCcomplete")

#AUC
  calc_auc(roc.rfCcomplete)
  importance(forestCcomplete)
varImpPlot(forestCcomplete)
  
```

### Random Forests with Imputed Values
Just as we did with the models based on the complete observations, we trained random forest models on imputed data for Ballots A, B, C and then tested them on their corresponding validation sets. Again, the model based on Ballot C data with imputed observations was most accurate. It had an overall error rate of 24.6%, with an error in correctly predicting married respondents of 5.5% and an error in correctly predicting split respondents of 70.5%. 

```{r rf imputed, include=FALSE, cache=TRUE}
####Trying Random Forest with condensed variables 
 
forestA <- randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam, data = trainAknn)
summary(trainAknn)
##Measuring error rate: 
forestA
## OOB estimate of  error rate: 24.51%
#Variable importance
  print(importance(forestA, type = 2))
  #Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#, 
    #degree, age ...etc. 
 plot(forestA)
#Search for most optimal number of input features
  forestA.tune <- tuneRF(trainAknn[,-1], trainAknn$marital2, ntreeTry = 500, 
                     mtryStart = 1, stepFactor = 2, 
                     improve = 0.001, trace = TRUE, plot = TRUE)
 
    ##lowest OOB error: mtry = 8 with an OOB error of 23.94% --> thus, 8 variables split = optimal

forestA.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainAknn, mtry=8, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testAknn)
  
  forestA.tune
  ##Married: OOB estimate of error rate is 24.15%; married = 6.15% vs. split: 65.65%
  
    #Predict values for train set
  pred.rf.trainA <- predict(forestA.tune, newdata = trainAknn, type='prob')[,2]

#Predict values for test set
  pred.rf.testA <- predict(forestA, testAknn, type='prob')[,2]

  #Set up ROC inputs
  input.rfA <- rbind(data.frame(model = "trainA", d = trainAknn$marital2, m = pred.rf.trainA), 
                  data.frame(model = "testA", d = testAknn$marital2,  m = pred.rf.testA))
  
#Graph all three ROCs
  roc.rfA <- ggplot(input.rfA, aes(d = d, model = model, m = m, colour = model)) + 
             geom_roc(show.legend = TRUE) + style_roc()  +ggtitle("TrainA")

#AUC
  calc_auc(roc.rfA)
  importance(forestA)
varImpPlot(forestA)
  
## Ballot B 
forestB <- randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam, data = trainBknn)
summary(trainBknn)
##Measuring error rate: 
forestB
## OOB estimate of  error rate: 25.57%
  ##6.13% error rate with guessing married, 73% error rate with guessing split 

#Variable importance
  print(importance(forestB, type = 2))
  #Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#, 
    #degree, age ...etc. 
 plot(forestB)
#Search for most optimal number of input features
  #forestB.tune <- tuneRF(trainBknn[,-1], trainBknn$marital2, ntreeTry = 500, 
                     #mtryStart = 1, stepFactor = 2, 
                     #improve = 0.001, trace = TRUE, plot = TRUE)
 
    ##lowest OOB error: mtry = 16 with an OOB error of 18.88% --> thus, 16 variables split = optimal

  forestB.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainBknn, mtry=16, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testBknn)
  
  forestB.tune
  ##Married: OOB estimate of error rate is 25.19%; married = 7.37% vs. split: 68.69%
  
  #plotROC
    #Predict values for train set
  pred.rf.trainB <- predict(forestB.tune, newdata = trainBknn, type='prob')[,2]

#Predict values for test set
  pred.rf.testB <- predict(forestB, testBknn, type='prob')[,2]

  #Set up ROC inputs
  input.rfB <- rbind(data.frame(model = "trainB", d = trainBknn$marital2, m = pred.rf.trainB), 
                  data.frame(model = "testB", d = testBknn$marital2,  m = pred.rf.testB))
  
#Graph all three ROCs
  roc.rfB <- ggplot(input.rfB, aes(d = d, model = model, m = m, colour = model)) + 
             geom_roc(show.legend = TRUE) + style_roc()  +ggtitle("TrainB")

#AUC
  calc_auc(roc.rfB)
  importance(forestB)
varImpPlot(forestB)


## Ballot C
forestC <- randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, data = trainCknn)
summary(trainCknn)
##Measuring error rate: 
forestC
## OOB estimate of  error rate: 24.09%
  ##5.42% error rate with guessing married, 69.95% error rate with guessing split 

#Variable importance
  print(importance(forestC, type = 2))
  #Most important variables: year, sex ed, region, premarsx, xmarsex, partyid, fundamentalism, children#, 
    #degree, age ...etc. 
 plot(forestC)
#Search for most optimal number of input features
  #forestC.tune <- tuneRF(trainCknn[,-1], trainCknn$marital2, ntreeTry = 500, 
                     #mtryStart = 1, stepFactor = 2, 
                     #improve = 0.001, trace = TRUE, plot = TRUE)
 
    ##lowest OOB error: mtry = 8 with an OOB error of 14.3% --> thus, 8 variables split = optimal

  forestC.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainCknn, mtry=8, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testCknn)
  
  forestC.tune
  ##Married: OOB estimate of error rate is 23.27%; married = 6.30% vs. split: 64.94%
  
  #Predict values for train set
  pred.rf.trainC <- predict(forestC.tune, newdata = trainCknn, type='prob')[,2]

#Predict values for test set
  pred.rf.testC <- predict(forestC, testCknn, type='prob')[,2]

  #Set up ROC inputs
  input.rfC <- rbind(data.frame(model = "trainC", d = trainCknn$marital2, m = pred.rf.trainC), 
                  data.frame(model = "testC", d = testCknn$marital2,  m = pred.rf.testC))
  
#Graph all three ROCs
  roc.rfC <- ggplot(input.rfC, aes(d = d, model = model, m = m, colour = model)) + 
             geom_roc(show.legend = TRUE) + style_roc()  +ggtitle("TrainC")

#AUC
  calc_auc(roc.rfC)
  importance(forestC)
varImpPlot(forestC)
  
  
#Source: http://dataaspirant.com/2017/01/02/k-nearest-neighbor-classifier-implementation-r-scratch/ 
```

### Random Forests: Comparing Results
Finally, we compared our random forest models by testing them on the test (rather than validation) sets corresponding to the complete observations for Ballots A, B and C. Here, the random forest model based on imputed data for Ballot C was most accurate. It had an overall error rate of 23.87% -- overally, it predicted marital status with 77% accuracy. It predicted married respondents with a 95.5% accuracy rate (4.5% error rate) and split respondents with a 29.5% accuracy rate (70.5% error rate). From these results, our random forest model based on imputed data for Ballot C could be a useful tool for predicting whether/not a respondent is married (and perhaps be/stay married), but not very powerful with respect to divorce. 

```{r rf complete model on test data, include=FALSE, cache=TRUE}

##Complete on Complete (use 'test' rather than 'validate')
## Train-validate-test split for GSS Ballot A

forestAcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainA, mtry=4, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testA)
   forestAcomplete.tune
######OOB estimate of error rate is 25.17%; married = 6.52% vs. split: 67.9%
 
##Complete on Complete 
## test for GSS Ballot B

  forestBcomplete.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainB, mtry=16, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testB)  #test group
  
  forestBcomplete.tune
  ##OOB estimate of error rate is 26.17%; married = 8.2% vs. split: 65.93%
  
 
#######Ballot C
  forestCcomplete.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainC, mtry=4, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testC)
  
  forestCcomplete.tune
  ##OOB estimate of error rate is 29.84%; married = 8.63% vs. split: 69.44%
  

```
```{r rf imputed model on completedata, include=FALSE, cache=TRUE}

##Imputed on Complete

###BallotA
forestAimponcomp.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw + fefam,data=trainAknn, mtry=8, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testA)
  forestAimponcomp.tune
###Trained on imputed dataset, tested on complete dataset
  ##OOB estimate of error rate is 23.97%; married = 7.2% vs. split: 64.3%
  ##Imputed (train on imputed trainAknn and tested on testA) is more accurate for Ballot A

###BallotB
forestBimponcomp.tune <-randomForest(marital2 ~ year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satfin + divlaw + xmovie + fefam,data=trainBknn, mtry=16, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testB)  #test group
  forestBimponcomp.tune
 ##OOBestimate of error rate is 25.4%; married = 7.62% vs. split: 68.77%
  ##Imputed is more accurate overall for Ballot B, but for split category specifically, the forestBcomplete.tune is more accurate (than imputed) for Ballot B
  
###BallotC 
   forestCimponcomp.tune <-randomForest(marital2 ~ year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + trust + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie,data=trainCknn, mtry=4, ntree=500, 
     keep.forest=TRUE, importance=TRUE,test=testC)
     forestCimponcomp.tune
  
     ##OOB  estimate of error rate: 23.92%: married = 4.78%; split: 70.9%
     ##Imputed is more accurate overall for Ballot C,but for the split category speficifically, training on the complete (forestCcomplete.tune) is more accurate 
   
```

## Analysis Method Four: Clusters
We also tried analyzing the data using clustering methods. Though the initial tests to determine the optimal number of clusters (2) seemed promising, the clusters did not accurately predict married or split status. Clustering was ineffective regardless of the clustering and distance methods used -- kmeans, pam (partitioning around medoids), hierarchical, Ward, single, complete, maximum, Manhattan, etc. 
```{r cluster setup, include=FALSE, cache = TRUE }
#using packages factoextra and NbClust and Cluster
# create binary matrix
gss.a.mat <- model.matrix(~0 + year + sexeduc + region + premarsx + xmarsex + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + satfin + abnomore + absingle + divlaw, GSS.A.comp)

# now with B ballot
gss.b.mat <- model.matrix(~0 + year + sexeduc + region + premarsx + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + bible + helpful + fair + consci + satjob + satfin + xmovie + divlaw + fefam, GSS.B.comp)

# now with Ballot C
gss.c.mat <- model.matrix(~0 + year + region + xmarsex + pornlaw + partyid + fund + childs + degree + Agecat1 + widowed + wrkstat + polviews + happy + trust + class + income + reg16 + family16 + born + parborn + sizecat + attend + relig16 + helpful + fair + consci + satjob + satfin + abnomore + absingle + xmovie, GSS.C.comp)
```

``` {r cluster A, include=FALSE, cache=TRUE}
#determine optimal number of clusters using method within-cluster sum of squares
# visualize as line ("elbow" method)
elbow.a <- fviz_nbclust(gss.a.mat, kmeans, method = "wss") 
# (silhouette method)
sil.plot.a <- fviz_nbclust(gss.a.mat, kmeans, method = "silhouette") 
  # optimal cluster is 2! 
# compute kmeans clusters
km.a <- kmeans(gss.a.mat, 2, 10)

# compute hierarchical clusters and check silhouette for validation 
hc.a <- eclust(gss.a.mat, "hclust", k=2, method="ward.D2", graph=FALSE)
sil.a <- hc.a$silinfo
sil.a$clus.avg.widths
  # not very good 0.44/0.42

# compute PAM clusters
pam.a <- pam(gss.a.mat, 2)
# visulize as silhouette plot
fviz_silhouette(pam.a)
  # similar to hierarchical


#compare clusters to marital status
table(GSS.A.comp$marital2, km.a$cluster)
table(GSS.A.comp$marital2, hc.a$cluster)
table(GSS.A.comp$marital2, pam.a$cluster)
```

For example, using the data from Ballot A, an elbow plot and a silhouette plot both indicate that the number of clusters should be 2. 

```{r cluster output A, echo=FALSE, warning=FALSE, message=FALSE}
grid.arrange(elbow.a, sil.plot.a, ncol=2)
```

However, none of kmeans, pam, and hierarchical clustering methods correctly categorizes by married and split. The following table shows the "accuracy" as calculcated by the number of married observations assigned to cluster 1 and the number of split observations assigned to cluster 2.  
```{r cluster B and C, include=FALSE, cache=TRUE}
#determine optimal number of clusters using method within-cluster sum of squares
# visualize as line ("elbow" method)
elbow.b <- fviz_nbclust(gss.b.mat, kmeans, method = "wss") 
# (silhouette method)
sil.plot.b <- fviz_nbclust(gss.b.mat, kmeans, method = "silhouette") 
  # optimal cluster is 2

# compute kmeans clusters
km.b <- kmeans(gss.b.mat, 2, 10)

# compute hierarchical clusters and check silhouette for validation 
hc.b <- eclust(gss.b.mat, "hclust", k=2, method="complete", graph=FALSE)
sil.b <- hc.b$silinfo
sil.b$clus.avg.widths
  # not very good  ~0.47/~0.38

# compute PAM clusters
pam.b <- pam(gss.b.mat, 2)
# visulize as silhouette plot
fviz_silhouette(pam.b)
  # similar to hierarchical

#compare clusters to marital status
table(GSS.B.comp$marital2, km.b$cluster)
table(GSS.B.comp$marital2, hc.b$cluster)
table(GSS.B.comp$marital2, pam.b$cluster)


#determine optimal number of clusters using method within-cluster sum of squares
# visualize as line ("elbow" method)
fviz_nbclust(gss.c.mat, kmeans, method = "wss") 
# (silhouette method)
fviz_nbclust(gss.c.mat, kmeans, method = "silhouette") 
  # optimal cluster is 2! 

# compute kmeans clusters
km.c <- kmeans(gss.c.mat, 2, 10)

# compute hierarchical clusters and check silhouette for validation 
hc.c <- eclust(gss.c.mat, "hclust", k=2, method="complete", graph=FALSE)
sil.c <- hc.c$silinfo
sil.c$clus.avg.widths
  # not very good  ~0.28/~0.41

# compute PAM clusters
pam.c <- pam(gss.c.mat, 2)
# visulize as silhouette plot
fviz_silhouette(pam.c)
  # similar to hierarchical

#compare clusters to marital status
table(GSS.C.comp$marital2, km.c$cluster)
table(GSS.C.comp$marital2, hc.c$cluster)
table(GSS.C.comp$marital2, pam.c$cluster)
```

Method       | Overall accuracy  
-------------|------------------
kmeans       | 0.55 
hierarchical | 0.48
PAM          | 0.48

As another test, average silhouette widths for the hierarchical clusters was calculated, but were only 0.44 and 0.42 (close to 1 is ideal). The results from Ballots B and C were not any more accurate, indicating that clustering is not the optimal method for this analysis. 

## Conclusion
In the end, we selected two different models as being the best predictors. In terms of predicting the likelihood of being married, the random forest model based on imputed data for Ballot C was most accurate. In terms of predicting the likelihood of being divorced, the decision tree model based on imputed data for Ballot C was most accurate.

Our analysis has limitations in its accuracy and its usefulness for policymakers or marriage counselors. Its accuracy is limited in that we only use data from one half of a couple. Our predictions would likely be much more accurate if we had data on both partners. Its usefulness may be limited in its application to younger generations if their patterns of marriage and divorce differ significantly from those of the people in our dataset. Despite this, our analysis provides some insights into characteristics that are related to an individual's likelihood of getting a divorce.