writeup.Rmd

---
title: "EDAV HW1 Class Survey"
header-includes: \usepackage{graphicx}
author: "Team Excel"
date: "February 11, 2016"
output: 
  pdf_document: 
    number_sections: yes
---

```{r echo=FALSE, warning=FALSE,message=FALSE}
library(ggdendro)
library(scales)
library(corrplot)
library(png)
library(rCharts)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(ggthemes)
library(manipulate)
library(grid)
library(gridExtra)
library(tidyr)
library(lubridate)
library(reshape)
library(ggfortify)

source("utils.R")

survey = read.csv("Survey+Response.csv") %>% clean_data()
learn_interest = read.csv("learn_interest_google.csv")
genderTech <- read.csv('gender-tech.csv')


learn_interest$Week = ymd(unlist(lapply(as.character(learn_interest$Week), function(x) strsplit(x," - ")[[1]][1])))

learn_interest_long = learn_interest%>% gather( term, count,  starts_with("learn")) %>% 
  mutate(year = year(Week), month = month(Week)) %>% group_by(year, month, term) %>% 
  summarize(count = sum(count) ) %>% mutate(date = ymd(paste(year,month,"01",sep="-")))


genderTech <- genderTech[order(genderTech$percentWomen),]

genderTech <- melt(genderTech, id=c('Company'))

genderTech$Company <- factor(genderTech$Company, 
                             levels=rev(c('STAT 4701: EDAV','eBay', 'Apple', 'Pinterest', 'Google',
                                          'LinkedIn', 'Yahoo', 'Facebook', 'Twitter')))

```


# Introduction

A survey was conducted on the class which consisted of nine questions about students technical competency across a variety of data tools. This is Team Excel’s analysis of the 140 observations generated by the survey. We aim to provide insights and a better understanding of data through the use of data visualization. At the same time we will share some of the reasons of why we have done these respective analysis.


# Respondent Information

In this section we will provide an overview of the respondents based on gender and program.

The butterfly bar plot below shows a comparison in the number of male and female students for each of the programs. Also it provides a good idea of the total counts of students from each program that participated in the survey.

```{r echo=FALSE, warning=FALSE, fig.width=8,fig.align='center', out.width='.9\\textwidth'}
class_dist = ggplot(data=survey,aes(x=program,fill=gender)) + 
  geom_bar(data = dplyr::filter(survey, gender=="Female")) + 
  geom_bar(data = dplyr::filter(survey, gender=="Male"), aes(y=..count..*(-1))) + 
  scale_y_continuous(breaks=seq(-40,40,10)) + 
  coord_flip()+
  theme_fivethirtyeight() + 
  scale_fill_tableau(name="gender") + 
  labs(title = "Distribution of the Class by Program and Gender")
class_dist
```

The gender imbalance in our class (29% women, 71% men) is representative of a larger cultural issue. Below, we compare the gender diversity in our course with the tech workforces of several of the Internet giants, who publicly released diversity reports in 2015.

```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.align='center', out.width='.9\\textwidth'}
gender_tech = ggplot(data = genderTech, aes(x = Company, y = value, fill = variable)) + coord_flip() +
  geom_bar(stat = "identity") + 
  labs(title='Gender in Tech Workforce', x='', y='') +
  scale_y_continuous(label=function(x) {return(paste(x,'%'))}) + 
  guides(fill=guide_legend(title=NULL)) +
  scale_fill_tableau(labels=c('Female', 'Male')) +
  theme_fivethirtyeight() 

gender_tech
```

Data Sources: [Apple](http://www.apple.com/diversity/), [Facebook](http://newsroom.fb.com/news/2014/06/building-a-more-diverse-facebook/), [Google](http://www.google.com/diversity/index.html), [Twitter](https://blog.twitter.com/2014/building-a-twitter-we-can-be-proud-of), [LinkedIn](http://blog.linkedin.com/2014/06/12/linkedins-workforce-diversity/), [Pinterest](https://engineering.pinterest.com/blog/diversity-and-inclusion-pinterest), [Yahoo](http://yahoo.tumblr.com/post/89085398949/workforce-diversity-at-yahoo), [eBay](https://www.ebayinc.com/stories/news/building-stronger-better-more-diverse-ebay/).

# Tools

Now we will explore tools used by the students by looking at distributions, relations, and comparisons to external data.  

This histogram shows the distribution of the number of tools known by the class.  The mean tool count is 7.605, with STD 3.52, min 1 and max 16.  The data is fairly symmetrically spread around the mean, with a good deal of mass above the mean, offset by a large spike at 6 tools, with over twice as many respondents knowing that amount of tools than the next highest bucket.

```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.align='center', out.width='.9\\textwidth'}
tool_dist = ggplot(survey, aes(x = number_tools))  + geom_density( aes(y=..count..))  + 
  geom_bar(alpha = .7)+
  theme_fivethirtyeight() +  
  scale_fill_tableau() + 
  labs(title="# of Tools Distribution", x = "tools")

tool_dist
```

This heatmap compares students across different programs based on an average count of the number of tools they have learned, split out by gender. The QMSS students have experience in the highest number of tools, followed by PhDs. Both figures are likely skewed by how few QMSS and PhD students there are in the class. The MS Stat students seem to know the fewest amount of tools.There does not seem to be much variation in skill level between men and women. 


```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.align='center', out.width='.9\\textwidth'}
ggplot(survey, aes(gender, program, fill = apply(survey[,12:29],1,sum))) + 
  geom_tile() +
  scale_fill_gradient2(high = "red",low = "white", ylab("Count")) + 
  theme_fivethirtyeight() + 
  scale_color_tableau() + 
  labs(title = "Heat Map: Skill Count by Program")
```

This violin plot compares the distribution of the number of tools known by respondents between programs, and within programs between genders.  We were interested in seeing if there were differences between programs in the number of tools each respondent knew.  From the plot it appears as though there’s higher variance amongst Male respondents even though the means between genders and programs are similar.

```{r,echo=FALSE, warning=FALSE,fig.width=5,fig.align='center', out.width='.9\\textwidth'}
# Python code to generate the image
# import seaborn as sns
# palette = sns.color_palette(['#1b62a5', '#fc690f'])
# sns.set(rc={'axes.facecolor': '#ececec', 'figure.facecolor':'#ececec', 'font.font-size': 30})
# sns.violinplot(x="program", y="number_tools", hue="gender", 
#    data=survey[survey.gender.isin(['Male', 'Female'])], split=True, 
#    palette=palette, size=6).set_title('Number of Tools by Program and Gender')

plot.new()
violin =readPNG("violin.png")

lim <- par()
rasterImage(violin, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])
```


This is colored matrix shows the positive or negative correlation between tools. Here are some of the correlations that can be observed:

* weak negative correlations between stata and github
* strong positive correlation between r and rstudio (of course)
* strong positive correlation between dropbox and googledrive
* strong positive correlation between shell and github, SQL, grep, SML, Web
* no correlation between r and python
* no correlation between r and excel
* no correlation between c and r

```{r,echo=FALSE, warning=FALSE,fig.width=9, fig.height=9,fig.align='center',out.width='.9\\textwidth'}
order = sort(apply(survey[,12:31],2,mean),decreasing = TRUE)

corrplot(cor(survey[,sort(names(order))]),method='square',diag = FALSE, mar = c(1,0,1,0),main = '          Tool Correlations')
```

This table displays the pairwise correlation of familiarity with each tool. A positive correlation between a pair indicates that an individual that knows one tool means that it is more likely the individual knows the second language then the class as a whole. The table can help reveal what tools are often learned together or that may rely on each other. For example, R and RStudio has the highest (positive) correlation which is unsurprising since familiarity with RStudio would suggest familiarity with R, and the two are typically used in combination. Other highly correlated tools are google drive and dropbox (both collaboration/cloud storage services), lattice and sweave (both R libraries), and shell and SQL. Looking at whole rows (or columns) reveals which tools tend to be known by individuals with a larger set of tools. For example, SQL and shell have multiple tools with a positive correlation suggesting that if someone knows SQL or shell, they probably know a larger number of tools than the class average. Conversely, someone who knows Stata, SPSS, or MatLab probably does not know more tools than typical. Overall, most correlations are positive suggesting knowing a specific tool means that you are more likely to know a greater number of other tools.


```{r,echo=FALSE, warning=FALSE,fig.width=10,fig.align='center',out.width='.9\\textwidth'}
tooldummies = survey[,12:31]
dist <- dist(t(tooldummies))
clust <- hclust(dist, method = 'average')
par(mar = c(1,0,2,1))
# plot(clust, main = 'Tool Cluster Dendrogram', axes = FALSE)
ggdendrogram(clust) + theme_fivethirtyeight() + labs(title = 'Tool Cluster Dendrogram') + theme(axis.text.y = element_blank(), axis.title = element_blank())
```


This dendrogram shows the agglomerative clustering using average linkage for the twenty tools. It reveals which tools are most similar by the individuals they share in common. The dendrogram includes some expected pairings based on the correlation table with highly correlated pairs grouped next to each other, but it is surprising that the very first split separate R from the different R libraries. It seems like the first split may be a split between common and uncommon tools since the six tools on the right branch are the six most commonly known tool in the survey.

```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.height=7,fig.align='center', out.width='.7\\textwidth'}
par(mar = c(3,1,4,1))
toolcount = apply(survey[,12:31],1,function(x) mean(x))

biplot(princomp(cbind(tooldummies,toolcount),scale = TRUE),xlabs=rep("",114), xlab = '', ylab = '',col = '#f8766d',xlim = c(-.21,.02),ylim = c(-.16,.11),main = 'Biplot from PCA of Tool Familiarity',bg = 'grey')
# pca_fit = princomp(cbind(tooldummies,toolcount),scale = TRUE)
# autoplot(princomp(cbind(tooldummies,toolcount),scale = TRUE), loadings = TRUE, loadings.label = TRUE,xlim = c(-.5, .5), ylim = c(-.6, .6))

```


This biplot is formed from a principal components analysis of the twenty tools as well as a total count of the tools. Most of the tools either point toward the top left or the bottom left, and the split is similar to the first split in the clustering with Python being a noticeable exception. The first principal component seems to be primarily overall tool expertise. This is supported by the fact that toolcount points directly to the left, and most tools point to the left half since knowing one tool means that you are more likely to know other tools. The exceptions to that are the same as from the correlation plot with Matlab, Stata, and SPSS all suggesting that familiarity with these tools doesn't make it more likely to know other tools. SQL and shell have the most negative scores on the x-axis which also agrees with the correlation plot in that knowing one of these tools suggest knowledge of a wider array of other tools. The second principal component is more difficult to interpret. Tools with high scores on the second principal component (C, shell, and Matlab) are all scripting languages that may be more common among students with an engineering background. The tools with the lowest score (googledrive, dropbox, Excel) are not scripting languages and are known by a more general group of people with or without programming experience. Then, in the middle, are statistical languages and packages such as R, Stata, and ggplot2.


*In order to make this analysis more relevant to the overall field of Data Science we have brought in data from google search and information from a survey provided by O'reilly*. Below is a comparison of google search trends for three terms: “learn python”, “learn R”, and “learn SQL”. The time period of the graph goes from 2004 to January 2016 and the left axis shows a scale with units relative to the highest point in the graph. It can be observed that from the three term only SQL has a negative trend, but it appears to trail off and stay consistent towards the end. Also Python has an exponential increase VS R which has a small gradual increase.


```{r echo=FALSE, warning=FALSE, fig.width=8,fig.align='center', out.width='.9\\textwidth', fig.cap="Data Source: Google Trends"}
google_search_trend = ggplot(filter(learn_interest_long,year>=2005)) + 
  geom_line(aes(x = date, y =count, color = term))+
  theme_fivethirtyeight() + 
  theme(axis.text.y  = element_blank(),axis.title.y  = element_blank())+
  scale_color_tableau() + 
  labs(title="Google search trends over time")

google_search_trend
```

O'reilly hosted an online survey about Data Science which was open to their audience from November 2014 to July 2015.The survey had 820 respondents from 47 countries, 38 states and across multiple industries. One quarter of the of the respondents have job titles that fall under Data Science and the rest of the sample comprised mostly of students, postdocs, professors, and consultants.The image below is the distribution of the responses to the questions: Which of the following tools do you use?

```{r,echo=FALSE, warning=FALSE,fig.width=5,fig.align='center', out.width='.9\\textwidth', fig.cap="Data Source: O'reilly 2015 Data Science Survey"}
plot.new()
violin =readPNG("oreally.png")

lim <- par()
rasterImage(violin, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])
```


A similar questions with fewer options was asked in our class survey and produced the following distribution. Although the two surveys differ in population size and experience levels, it is easy to see that tools such as R, Python and Excel are the most used by both populations.

```{r echo=FALSE, warning=FALSE, fig.width=8,fig.align='center', out.width='.9\\textwidth'}
plotdata = as.data.frame(order)

order = sort(apply(survey[,12:31],2,mean),decreasing = TRUE)
plotdata = as.data.frame(cbind(names(order),apply(survey[,names(order)],2,mean)))
names(plotdata) <- c('tool','prop')
plotdata$proportion <- as.numeric(as.character(plotdata$prop))
plotdata$tool <- factor(plotdata$tool,levels=names(order))

ggplot(plotdata,aes(tool,proportion)) + geom_bar(stat = 'identity') + theme_fivethirtyeight() +
  scale_color_tableau() + ggtitle('Survey Tool Familiarity') + xlab("") + ylab("")  +
  scale_y_continuous(labels=percent, limits = c(0,1)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


Below is a barplot of the number of students that use each respective text editor. We also added a color to differentiate student from different programs. To achieve these groupings we combined term that referred to the same text editor. RStudio is the most frequently used text editor by our class, based on the survey results. This could be due to the fact that R is the main language used in this course. Often, a person’s preferred text editor depends on the task at hand: RStudio for R, iPython for Python, Sublime for JavaScript. RStudio and iPython are also computational environments, which fall into a different category than TextWrangler, Atom, Sublime, etc.


```{r,echo=FALSE, warning=FALSE, fig.width=8,fig.align='center', out.width='.9\\textwidth'}
ggplot(survey) + geom_bar(aes(x=reorder_size(primaryeditor,F), fill=program)) + 
  coord_flip() +
  theme_fivethirtyeight()+
  scale_fill_tableau()+
  labs(x="editor")
```

# Experience

In this section we will focus on observing the relationship between specific technical experience and the rest of the variables.

This series of butterfly bar plots provides a general idea for the amount of male and female students that felt they have experience across six different areas and tools.

```{r,echo=FALSE, warning=FALSE, fig.width=8,fig.align='center', out.width='.9\\textwidth'}
experience = survey %>% gather( language,experience,  starts_with("exp.")) %>% 
  mutate(experience = factor(experience)) 

levels(experience$experience) = c("None" ,"A little",  "Confident", "Expert")
levels(experience$language) = c("R Modeling", "R Graphics", " R advanced", 
                                "documentation", "Matlab", "Github")

exp_gender = ggplot(experience,aes(x=experience, fill = gender)) + 
  geom_bar(data = dplyr::filter(experience, gender=="Female")) + 
  geom_bar(data = dplyr::filter(experience, gender=="Male"), aes(y=..count..*(-1))) + 
  # scale_y_continuous(breaks=seq(-40,40,10)) +
  # labs(title=var)+
  coord_flip()+
  theme_fivethirtyeight() + scale_fill_tableau(name = "gender")+ 
  facet_wrap(~language,ncol = 3) +
  labs(title="Experience by Program")+
  theme(axis.title.y  = element_blank())
exp_gender
```


The panel of box plots below compares the confidence level of a specific skill(modeling, graphics, documentation, etc.) to the number of tools one has experience with. As the number of tools increases the confidence level with respect to a skill also tends to increase. For example, students who claim to be “Expert” in R-modeling knew on average 12 total programs, while students who claimed to know “A little” with respect to R-modeling averaged about 5 total programs. One exception is seen in the Github experience boxplot. Respondents who claimed to be experts in Github had not necessarily been exposed to a higher number of tools compared to respondents who claimed to be confident in Github.


```{r,echo=FALSE, warning=FALSE, fig.width=8,fig.align='center', out.width='.9\\textwidth'}
ggplot(experience) + geom_boxplot(aes(x = experience, y = experience_programming), fill = "lightgrey")  + 
  labs(title="Tool Counts vs Experience Level", y = "# of programs")+theme_fivethirtyeight()+
  facet_wrap(~language,ncol = 3)

```


This histogram plots the distribution of experience levels in the class.  We converted experience levels (None, A little, Confident, Expert) to values 0 - 3 for the experience-related columns so we could sum them and get a sense of the overall experience level of respondents.  Given there were six experience columns, the maximum possible ‘experience’ value is 18.  The mean for the class was 6.46, with std of 3.37, min of 0 and max of 15.  The distribution is fairly symmetric around the mean, with a slightly longer upper tail indicating that most respondents have some experience with analytical tools and techniques, but not extensive experience.

```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.align='center', out.width='.9\\textwidth'}

exp_dist = ggplot(survey, aes(x = experience_programming))  + geom_density( aes(y=..count..))  + 
  geom_bar(alpha = .7) +
  theme_fivethirtyeight() +  
  scale_fill_tableau() + 
  labs(title="Experience Distribution", x = "experience")

exp_dist
```

This box and whisker plot also shows the distribution of the experience levels (as described above) within different programs.  Because some programs have less than five respondents it’s difficult to make comparisons across groups.  That said, there are some interesting things that we noticed about the plot.  First, the Data Science programs have the widest ranges in experience levels.  They both have the lowest and highest experience levels, though the certification mean lower than the MS (and lowest of all the programs.

```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.align='center', out.width='.9\\textwidth'}
ggplot(survey)+ aes(y = experience_programming, x =program, fill = program) + geom_boxplot()+ 
    stat_summary(fun.data =give.n, geom = "text" ) +
    theme_fivethirtyeight() +  
    theme(legend.position="none")+
    scale_fill_tableau()+
    labs(title="Experience by program", y = "experience", x="program")

```


Like the boxplot above, this violin plot also compares distributions of experience levels between programs.  This plot, however, also compares experience levels within groups.  Further, instead of bounding the box at the 25th and 75th percentiles, the violin plot uses kernel density estimation to estimate the distribution of the experience data.  Distributions for Female and Male variables are reflected over the experience axis to show differences in distributions within those categories (and the program category).  This plot also provided some interesting insights: the Male category generally had greater variance in experience levels.  There also appeared to be an inclination towards bimodality in many of the distributions, with a higher and lower skill groups seen in many groups.

```{r,echo=FALSE, warning=FALSE,fig.width=5,fig.align='center', out.width='.9\\textwidth'}
# Python code to generate the plot
# sns.violinplot(x="program", y="experience", hue="gender", 
#    data=survey[survey.gender.isin(['Male', 'Female'])], split=True, 
#    palette=palette, size=6).set_title('Experience Level by Program and Gender')

plot.new()
violin2 =readPNG("violin2.png")

lim <- par()
rasterImage(violin2, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])
```

This scatter plot compares the number of tools the respondent indicated having experience with and their overall experience (tools, languages, etc.).  This plot layers a linear regression line on top of the scatter plot, along with error bands, to show the linear relationship.  It’s immediately apparent from the linear model and error bands that there’s a strong positive relationship between the two variables.

```{r,echo=FALSE, warning=FALSE,fig.width=7,fig.align='center', out.width='.9\\textwidth'}
ggplot(survey) + aes(y = experience_programming, x = number_tools) +
  geom_point()+ geom_smooth(method = "lm")+ 
  theme_fivethirtyeight() +  
  scale_fill_tableau() + 
  labs(title="experience vs tools", x = "tools", y="experience")

```

# Conclusion