diff --git a/PUBLICITY.md b/PUBLICITY.md new file mode 100644 index 0000000..8a15c07 --- /dev/null +++ b/PUBLICITY.md @@ -0,0 +1,55 @@ +# Facetweet announcement + +Learn how to analyze your datasets in R! [insert link here](https://youtu.be/dQw4w9WgXcQ) + +# Information for calendar + +The workshop duration is 3hrs per class. + +# Descriptions for website + +## Header + +**title** : R for Data Science + +**description** : The R for Data Science workshop series is a four part course, designed to take novices in the R language for statistical computing and produce programmers who are competent in finding, displaying, analyzing, and publishing data in R. + +## Part 1 + +**subtitle** : Basics of R + +**description** : Students will understand the motivation behind object orientation, and how that relates to computation. Students will be able to perform basic functions in R necessary to use the software on their computers and conduct basic arithmetic. Students will understand data types and data structures, and why and how they are different from each other. + +**knowledge requirements** : [Programming Fun!damentals](https://github.com/dlab-berkeley/programming-fundamentals), or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required + +## Part 2 + +**subtitle** : Clean and tidy data + +**description** : Students will be introduced to DRY principles and best practices for sanitizing and tidying data. Students will learn what missingness is, and how best to accommodate missing data in their research designs. Students will be able to read in files from disk or a database, clean the data found within them, select specific data from them, and merge them with other datasets. + +**knowledge requirements** : R-for-Data-Science Part 1 or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required + +## Part 3 + +**subtitle** : Analyzing data + +**description** : Students will be introduced to the principles behind the grammar of graphics and the general linear model. Students will understand the implementation of plotting in R. Students will be able to explore, summarize, and analyze data using R's implementation of exploratory and inferential data analysis. + +**knowledge requirements** : R-for-Data-Science Part 2 or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required + +## Part 4 + +**subtitle** : Functions and packages + +**description** : Students will be introduced to the principles behind functional programming. Students will learn how to write and import functions, add looped and vectorized computation to their functions, and control the flow of data through a function. Students will understand the basics of name spaces, and how that relates to assigning values within functions. Students will see how to successfully package a function for CRAN. + +**knowledge requirements** : R-for-Data-Science Part 2 or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required diff --git a/README.md b/README.md index 3c4651e..b46ecc0 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ The instructor of this workshop series will lead you through the activities for ## If you are a D-Lab instructor -You'll see accumulated teaching notes and examples for each day's topics in the instructor folder. For your convenience, these are available as .Rmd, commented .R files, PDF documents, and HTML slides. +You'll see accumulated teaching notes and examples for each day's topics in the instructor folder. For your convenience, these are available as .Rmd, commented .R files, PDF documents, and HTML slides. The meta-document for this workshop series, which explains the logic behind the structure and topics, can be viewed [at the D-Lab guides repository](https://github.com/dlab-berkeley/guides/blob/master/r.pdf) For information on contributing to this repository, see `CONTRIBUTING.md` @@ -61,17 +61,17 @@ This workshop series covers: This workshop uses the following packages: -1. Amelia -2. devtools -3. dplyr -4. foreign -5. ggplot2 -6. parallelMap -7. RCurl -8. reshape2 -9. roxygen2 -10. stringr -11. XML +* Amelia +* devtools +* dplyr +* foreign +* ggplot2 +* parallelMap +* RCurl +* roxygen2 +* stringr +* tidyr +* XML --- _D-Lab == Data Intensive Social Science, For All!_ diff --git a/data/dirty.csv b/data/dirty.csv index 92aa05e..b7d2c22 100644 --- a/data/dirty.csv +++ b/data/dirty.csv @@ -1,6 +1,6 @@ Timestamp,How tall are you?,What department are you in?,Are you currently enrolled?,What is your birth order? 7/25/2015 10:08:41,very,Geology ,Yes,1 7/25/2015 10:10:56,70,999,Yes,1 -7/25/2015 10:11:20,5’9, geology,999,2 +7/25/2015 10:11:20,5'9, geology,999,2 7/25/2015 10:11:25,2.1,goelogy,No,"9,000" -7/25/2015 10:11:29,156,anthro,999,2 \ No newline at end of file +7/25/2015 10:11:29,156,anthro,999,2 diff --git a/instructor/day_four.html b/instructor/day_four.html index 4576a9c..d3c37e2 100644 --- a/instructor/day_four.html +++ b/instructor/day_four.html @@ -55,7 +55,7 @@

Day Four: Functional Programming

Dillon Niederhut
Shinhye Choi

-

02 May, 2016

+

24 May, 2016

@@ -83,7 +83,7 @@

Looping

} # or mat <- c(rep(NA, 6)) -for(i in 5:10){ +for(i in 5:10){ mat[i-4] <- 2^i } # by setting sequence and statement accordingly

You can also loop over a non-numeric vector

@@ -93,11 +93,11 @@

Looping

for(city in c("Berkeley", "Walnut Creek", "Richmond")){ if(sum(city==city.temp$a)>0){ - print(city.temp[which(city==city.temp$a),]) + print(city.temp[which(city==city.temp$a),]) # if we have the city in our data, then print it's temperature and the name of the city } if(sum(city==city.temp$a)==0){ - print(paste(city, "is NOT in the data. :(", sep=" ")) + print(paste(city, "is NOT in the data. :(", sep=" ")) # if not, then just print the name of the city next to "is Not in the data. :(" } } # Loops can be as complicated and long as they could be. Often not so efficient. @@ -114,7 +114,7 @@

Looping

system.time( for(i in 1:1000){ print(i) - if(i == 50) break + if(i == 50) break })

Next we move on to control structures, such as if statements. ``If" statements are very useful when you want to assign different tasks to different subsets of data using a single for-loop. The basic syntax looks like the following: if(condition){statement} else{other statement}

@@ -123,7 +123,7 @@

Looping

x <- 7
 if(x > 10){
   print(x)
-  
+
   }else{                     # "else" should not start its own line. 
                              # Always let it be preceded by a closing brace on the same line.
   print("NOT BIG ENOUGH!!")
@@ -137,15 +137,15 @@ 

Looping

## [8] "male" "male" "male" "female" "female" "male" "female" ## [15] "male" "female" "female" "male" "male" "female" "female" ## [22] "male" "male" "male" "male" "female" "male" "female" -## [29] "female" "male" "male" "female" "female" "female" "male" -## [36] "female" "female" "female" "female" "female" "male" "male" -## [43] "male" "female" "male" "female" "female" "male" "male" +## [29] "female" "male" "male" "female" "female" "female" "male" +## [36] "female" "female" "female" "female" "female" "male" "male" +## [43] "male" "female" "male" "female" "female" "male" "male" ## [50] "male" "male" "male" "male" "female" "female" "female" -## [57] "male" "female" "male" "female" "male" "female" "male" -## [64] "female" "female" "female" "female" "male" "male" "male" +## [57] "male" "female" "male" "female" "male" "female" "male" +## [64] "female" "female" "female" "female" "male" "male" "male" ## [71] "female" "male" "female" "male" "female" "female" "female" -## [78] "female" "male" "male" "male" "male" "female" "male" -## [85] "female" "male" "female" "male" "female" "female" "male" +## [78] "female" "male" "male" "male" "male" "female" "male" +## [85] "female" "male" "female" "male" "female" "female" "male" ## [92] "male" "male" "male" "female" "female" "female" "female" ## [99] "female" "male"
gender <- ifelse(gender=="male", 1, 0)
@@ -175,7 +175,7 @@ 

every function has three parts

body(f)
## x + 1
environment(f)
-
## <environment: 0x7f81d9374c60>
+
## <environment: 0x7fe163bba308>

environments are where the function was defined

see how our function has R_GlobalEnv as it’s environment? that’s because we defined it in the global environment

this means that if you tell a function to look for an object, it will look in the global namespace

@@ -260,13 +260,13 @@

the right way to be functional

lapply(heights, in_to_m)
## [[1]]
 ## [1] 1.7526
-## 
+##
 ## [[2]]
 ## [1] 1.3716
-## 
+##
 ## [[3]]
 ## [1] 1.8542
-## 
+##
 ## [[4]]
 ## [1] 2.0828

it’s not always smart to name functions

@@ -274,13 +274,13 @@

it’s not always smart to name
lapply(heights, FUN = function(x) x %/% 12)
## [[1]]
 ## [1] 5
-## 
+##
 ## [[2]]
 ## [1] 4
-## 
+##
 ## [[3]]
 ## [1] 6
-## 
+##
 ## [[4]]
 ## [1] 6

lapply has limits

@@ -293,10 +293,10 @@

lapply has limits

lapply(dat, mean)
## $a
 ## [1] NA
-## 
+##
 ## $b
 ## [1] NA
-## 
+##
 ## $c
 ## [1] NA

we know there are numbers there - why are the means all missing?

@@ -315,16 +315,16 @@

this can be parallelized

side note - previous versions of these materials imported the parallel library, which is no longer supported as of R versions >= 3.2

install.packages('parallelMap')
-
## 
+
##
 ## The downloaded binary packages are in
-##  /var/folders/rj/8gpcssqd52z9yrqw7f8xxfym0000gn/T//Rtmp2xjYZ7/downloaded_packages
+ /var/folders/rj/8gpcssqd52z9yrqw7f8xxfym0000gn/T//RtmpmP1txl/downloaded_packages
library(parallelMap)
 system.time(Map(median, dat, na.rm=TRUE))
-
##    user  system elapsed 
+
##    user  system elapsed
 ##   0.000   0.000   0.001
system.time(parallelMap(median, dat, na.rm=TRUE))
-
##    user  system elapsed 
-##   0.001   0.000   0.001
+
##    user  system elapsed
+##       0       0       0

parallel processing incurs time costs from memory management and message passing that can make small jobs take longer in parallel than in serial

@@ -372,7 +372,7 @@

adding dependencies

what if there are other packages that your package uses? like ggplot2? do

Imports: ggplot

and if you want list optional packages, you can do so like this:

-
Suggests: 
+
Suggests:
   reshape2 (>=1.4.1)
   plyr (>=1.8.3)
@@ -396,19 +396,19 @@

creating man pages

library(roxygen2)

now we’re going to add specialized comments to our length.R file

#' Converts inches to centimeters
-#' 
+#'
 #' @param x A numeric
 #' @return Converted numeric
-#' @examples 
+#' @examples
 #' in_to_cm(1)
 #' in_to_cm(c(1,2,3))
 in_to_cm <- function(x) x * 2.54
 
 #' Converts inches to meters
-#' 
+#'
 #' @param x A numeric
 #' @return Converted numeric
-#' @examples 
+#' @examples
 #' in_to_m(1)
 #' in_to_m(c(1,2,3))
 in_to_m <- function(x){
@@ -423,20 +423,20 @@ 

NAMESPACE

export(in_to_m)

or you can have roxygen2 handle it for you by adding #' @export in the function blocks you want to have exported

#' Converts inches to centimeters
-#' 
+#'
 #' @param x A numeric
 #' @return Converted numeric
-#' @examples 
+#' @examples
 #' in_to_cm(1)
 #' in_to_cm(c(1,2,3))
 #' @export
 in_to_cm <- function(x) x * 2.54
 
 #' Converts inches to meters
-#' 
+#'
 #' @param x A numeric
 #' @return Converted numeric
-#' @examples 
+#' @examples
 #' in_to_m(1)
 #' in_to_m(c(1,2,3))
 #' @export
@@ -507,31 +507,31 @@ 

Example code

## [1] "<H3>ACT I</h3>"   "<H3>ACT II</h3>"  "<H3>ACT III</h3>"
 ## [4] "<H3>ACT IV</h3>"  "<H3>ACT V</h3>"
RJ[grep("<h3>", RJ, perl=TRUE)]
-
##  [1] "<h3>PROLOGUE</h3>"                                                        
-##  [2] "<h3>SCENE I. Verona. A public place.</h3>"                                
-##  [3] "<h3>SCENE II. A street.</h3>"                                             
-##  [4] "<h3>SCENE III. A room in Capulet's house.</h3>"                           
-##  [5] "<h3>SCENE IV. A street.</h3>"                                             
-##  [6] "<h3>SCENE V. A hall in Capulet's house.</h3>"                             
-##  [7] "<h3>PROLOGUE</h3>"                                                        
-##  [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>"               
-##  [9] "<h3>SCENE II. Capulet's orchard.</h3>"                                    
-## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>"                               
-## [11] "<h3>SCENE IV. A street.</h3>"                                             
-## [12] "<h3>SCENE V. Capulet's orchard.</h3>"                                     
-## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>"                                
-## [14] "<h3>SCENE I. A public place.</h3>"                                        
-## [15] "<h3>SCENE II. Capulet's orchard.</h3>"                                    
-## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>"                               
-## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>"                            
-## [18] "<h3>SCENE V. Capulet's orchard.</h3>"                                     
-## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>"                                 
-## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>"                              
-## [21] "<h3>SCENE III. Juliet's chamber.</h3>"                                    
-## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>"                              
-## [23] "<h3>SCENE V. Juliet's chamber.</h3>"                                      
-## [24] "<h3>SCENE I. Mantua. A street.</h3>"                                      
-## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>"                                
+
##  [1] "<h3>PROLOGUE</h3>"
+##  [2] "<h3>SCENE I. Verona. A public place.</h3>"
+##  [3] "<h3>SCENE II. A street.</h3>"
+##  [4] "<h3>SCENE III. A room in Capulet's house.</h3>"
+##  [5] "<h3>SCENE IV. A street.</h3>"
+##  [6] "<h3>SCENE V. A hall in Capulet's house.</h3>"
+##  [7] "<h3>PROLOGUE</h3>"
+##  [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>"
+##  [9] "<h3>SCENE II. Capulet's orchard.</h3>"
+## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>"
+## [11] "<h3>SCENE IV. A street.</h3>"
+## [12] "<h3>SCENE V. Capulet's orchard.</h3>"
+## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>"
+## [14] "<h3>SCENE I. A public place.</h3>"
+## [15] "<h3>SCENE II. Capulet's orchard.</h3>"
+## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>"
+## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>"
+## [18] "<h3>SCENE V. Capulet's orchard.</h3>"
+## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>"
+## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>"
+## [21] "<h3>SCENE III. Juliet's chamber.</h3>"
+## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>"
+## [23] "<h3>SCENE V. Juliet's chamber.</h3>"
+## [24] "<h3>SCENE I. Mantua. A street.</h3>"
+## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>"
 ## [26] "<h3>SCENE III. A churchyard; in it a tomb belonging to the Capulets.</h3>"
Now that we know that the first line of each act begins with the string ``

@@ -548,41 +548,41 @@

}

How should we count the number of the words appear in each act? Create a wrapper function that counts the number of the words and returns the number.

countR <- function(z){
-  return(c(length(grep("Romeo", z, perl=T)), length(grep("Juliet", z, perl=T)))) 
+  return(c(length(grep("Romeo", z, perl=T)), length(grep("Juliet", z, perl=T))))
 }
 lapply(x, countR)
## [[1]]
 ## [1] 8 4
-## 
+##
 ## [[2]]
 ## [1] 30  3
-## 
+##
 ## [[3]]
 ## [1] 54 13
-## 
+##
 ## [[4]]
 ## [1] 9 8
-## 
+##
 ## [[5]]
 ## [1] 20 19

Now count the lines in each scene

# now count the lines in each scene
 countL <- function(z){
-  return(length(grep("</A><br>$", z, perl=T))) 
+  return(length(grep("</A><br>$", z, perl=T)))
 }
 lapply(x, countL)
## [[1]]
 ## [1] 739
-## 
+##
 ## [[2]]
 ## [1] 685
-## 
+##
 ## [[3]]
 ## [1] 821
-## 
+##
 ## [[4]]
 ## [1] 407
-## 
+##
 ## [[5]]
 ## [1] 441
diff --git a/instructor/day_one.html b/instructor/day_one.html index f2fd818..c80ac3d 100644 --- a/instructor/day_one.html +++ b/instructor/day_one.html @@ -54,7 +54,7 @@

Day One: R Basics

Dillon Niederhut

-

02 May, 2016

+

24 May, 2016

@@ -96,23 +96,23 @@

Object Oriented Programming

everything in R is an object

yes, even the commands, just watch

ls
-
## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
-##     pattern, sorted = TRUE) 
+
## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
+##     pattern, sorted = TRUE)
 ## {
 ##     if (!missing(name)) {
 ##         pos <- tryCatch(name, error = function(e) e)
 ##         if (inherits(pos, "error")) {
 ##             name <- substitute(name)
-##             if (!is.character(name)) 
+##             if (!is.character(name))
 ##                 name <- deparse(name)
-##             warning(gettextf("%s converted to character string", 
+##             warning(gettextf("%s converted to character string",
 ##                 sQuote(name)), domain = NA)
 ##             pos <- name
 ##         }
 ##     }
 ##     all.names <- .Internal(ls(envir, all.names, sorted))
 ##     if (!missing(pattern)) {
-##         if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+##         if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
 ##             ll != length(grep("]", pattern, fixed = TRUE))) {
 ##             if (pattern == "[") {
 ##                 pattern <- "\\["
@@ -127,7 +127,7 @@ 

everything in R is an object

## } ## else all.names ## } -## <bytecode: 0x7f81da153fb8> +## <bytecode: 0x7fe163574678> ## <environment: namespace:base>

ls, like basketball, is a specific thing with a name and stuff inside it that makes it ls and not dillon niederhut. in this particular instance, we are looking at the function that tells you what objects are in your environment

until we get to functional programming, your environment is just R plus whatever you put in R

@@ -135,19 +135,23 @@

in R, you store o

just like you need names to tell things apart, R does too

my.name <- dir
 my.name
-
## function (path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE, 
-##     recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE, 
-##     no.. = FALSE) 
-## .Internal(list.files(path, pattern, all.files, full.names, recursive, 
+
## function (path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE,
+##     recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE,
+##     no.. = FALSE)
+## .Internal(list.files(path, pattern, all.files, full.names, recursive,
 ##     ignore.case, include.dirs, no..))
-## <bytecode: 0x7f81dca955f0>
+## <bytecode: 0x7fe164e12268>
 ## <environment: namespace:base>

names must be unique

everytime you give an object a name, it removes anything that already had that name from your environment

my.name <- dir()
 my.name
-
## [1] "CONTRIBUTING.md" "data"            "examples"        "instructor"     
-## [5] "LICENSE"         "README.md"       "scripts"
+
##  [1] "CONTRIBUTING.md"            "convertR"
+##  [3] "convertR_0.0.0.9000.tar.gz" "data"
+##  [5] "examples"                   "instructor"
+##  [7] "LICENSE"                    "PUBLICITY.md"
+##  [9] "R-intensive.Rproj"          "README.md"
+## [11] "scripts"

you see those parentheses? that means you are calling an object (here, it’s a function evaluator) on dir.

classes in R

because it is code to be evalueated, dir belongs in a class called ‘functions’

@@ -182,19 +186,23 @@

tell R where you would like i
setwd("/Users/dillonniederhut/Dropbox/dlab/R-for-Data-Science")

find out what’s in your directory with

dir()
-
## [1] "CONTRIBUTING.md" "data"            "examples"        "instructor"     
-## [5] "LICENSE"         "README.md"       "scripts"
+
##  [1] "CONTRIBUTING.md"            "convertR"
+##  [3] "convertR_0.0.0.9000.tar.gz" "data"
+##  [5] "examples"                   "instructor"
+##  [7] "LICENSE"                    "PUBLICITY.md"
+##  [9] "R-intensive.Rproj"          "README.md"
+## [11] "scripts"

find out what’s in your environment with

in R, you are always in an environment (more on scoping in day 4)

ls()
-
##  [1] "document"     "my.character" "my.data"      "my.date"     
-##  [5] "my.factor"    "my.list"      "my.name"      "my.vector"   
+
##  [1] "document"     "my.character" "my.data"      "my.date"
+##  [5] "my.factor"    "my.list"      "my.name"      "my.vector"
 ##  [9] "test"         "your.vector"

our environment is currently empty

test <- "I have no idea what I'm doing"
 ls()
-
##  [1] "document"     "my.character" "my.data"      "my.date"     
-##  [5] "my.factor"    "my.list"      "my.name"      "my.vector"   
+
##  [1] "document"     "my.character" "my.data"      "my.date"
+##  [5] "my.factor"    "my.list"      "my.name"      "my.vector"
 ##  [9] "test"         "your.vector"

we can clean our environment with

rm(list = ls())
@@ -205,53 +213,53 @@ 

and search the help pages with ??<
??exists

you can get a quick example with

example(exists)
-
## 
+
##
 ## exists> ##  Define a substitute function if necessary:
 ## exists> if(!exists("some.fun", mode = "function"))
 ## exists+   some.fun <- function(x) { cat("some.fun(x)\n"); x }
-## 
+##
 ## exists> search()
-##  [1] ".GlobalEnv"          "package:Amelia"      "package:Rcpp"       
+##  [1] ".GlobalEnv"          "package:Amelia"      "package:Rcpp"
 ##  [4] "package:roxygen2"    "package:devtools"    "package:parallelMap"
-##  [7] "package:rmarkdown"   "package:knitr"       "package:stats"      
-## [10] "package:graphics"    "package:grDevices"   "package:utils"      
-## [13] "package:datasets"    "package:methods"     "Autoloads"          
-## [16] "package:base"       
-## 
+##  [7] "package:rmarkdown"   "package:knitr"       "package:stats"
+## [10] "package:graphics"    "package:grDevices"   "package:utils"
+## [13] "package:datasets"    "package:methods"     "Autoloads"
+## [16] "package:base"
+##
 ## exists> exists("ls", 2) # true even though ls is in pos = 3
 ## [1] TRUE
-## 
+##
 ## exists> exists("ls", 2, inherits = FALSE) # false
 ## [1] FALSE
-## 
+##
 ## exists> ## These are true (in most circumstances):
 ## exists> identical(ls,   get0("ls"))
 ## [1] TRUE
-## 
+##
 ## exists> identical(NULL, get0(".foo.bar.")) # default ifnotfound = NULL (!)
 ## [1] TRUE
-## 
-## exists> ## Don't show: 
+##
+## exists> ## Don't show:
 ## exists> stopifnot(identical(ls, get0("ls")),
 ## exists+           is.null(get0(".foo.bar.")))
-## 
+##
 ## exists> ## End(Don't show)
-## exists> 
-## exists> 
+## exists>
+## exists>
 ## exists>

when you kind of remember what you are looking for, try

apropos('lm')
-
##  [1] ".__C__anova.glm"      ".__C__anova.glm.null" ".__C__glm"           
-##  [4] ".__C__glm.null"       ".__C__lm"             ".__C__mlm"           
-##  [7] ".__C__optionalMethod" ".colMeans"            ".lm.fit"             
-## [10] "colMeans"             "confint.lm"           "contr.helmert"       
-## [13] "dummy.coef.lm"        "getAllMethods"        "glm"                 
-## [16] "glm.control"          "glm.fit"              "KalmanForecast"      
-## [19] "KalmanLike"           "KalmanRun"            "KalmanSmooth"        
-## [22] "kappa.lm"             "lm"                   "lm.fit"              
-## [25] "lm.influence"         "lm.wfit"              "model.matrix.lm"     
-## [28] "nlm"                  "nlminb"               "parallelMap"         
-## [31] "predict.glm"          "predict.lm"           "residuals.glm"       
+
##  [1] ".__C__anova.glm"      ".__C__anova.glm.null" ".__C__glm"
+##  [4] ".__C__glm.null"       ".__C__lm"             ".__C__mlm"
+##  [7] ".__C__optionalMethod" ".colMeans"            ".lm.fit"
+## [10] "colMeans"             "confint.lm"           "contr.helmert"
+## [13] "dummy.coef.lm"        "getAllMethods"        "glm"
+## [16] "glm.control"          "glm.fit"              "KalmanForecast"
+## [19] "KalmanLike"           "KalmanRun"            "KalmanSmooth"
+## [22] "kappa.lm"             "lm"                   "lm.fit"
+## [25] "lm.influence"         "lm.wfit"              "model.matrix.lm"
+## [28] "nlm"                  "nlminb"               "parallelMap"
+## [31] "predict.glm"          "predict.lm"           "residuals.glm"
 ## [34] "residuals.lm"         "summary.glm"          "summary.lm"

@@ -418,8 +426,8 @@

try giving your factor explicitly numeric levels and character labels

-
my.factor <- factor(c(1,2,3,4), 
-                    levels=c(1,2,3,4), 
+
my.factor <- factor(c(1,2,3,4),
+                    levels=c(1,2,3,4),
                     labels=c('undergraduate','graduate','professor','staff'))
 levels(my.factor)
## [1] "undergraduate" "graduate"      "professor"     "staff"
@@ -480,10 +488,10 @@

a li my.list

## [[1]]
 ## [1] TRUE
-## 
+##
 ## [[2]]
 ## [1] "two"
-## 
+##
 ## [[3]]
 ## [1] 3

you can find out the attributes for and types of data in a list with

diff --git a/instructor/day_three.R b/instructor/day_three.R index 7a4c6db..11a84bd 100644 --- a/instructor/day_three.R +++ b/instructor/day_three.R @@ -10,20 +10,28 @@ summary(dat) table(dat$department) ## ------------------------------------------------------------------------ -dat$wday <- factor(weekdays(dat$timestamp, abbreviate = TRUE), - levels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun') - ) -summary(dat$wday) +library(psych) +describe(dat) + +## ------------------------------------------------------------------------ +library(dplyr) +dat %>% group_by(gender) %>% summarize(n()) ## ------------------------------------------------------------------------ -library(reshape2) -dcast(dat[dat$gender == 'Female/Woman' | dat$gender == 'Male/Man',], department ~ gender) -dcast(melt(dat, measure.vars = c('course.delivered')), wday ~ 'Delivered', fun.aggregate = mean) +library(tidyr) +dat %>% filter(!is.na(gender)) %>% group_by(gender, department) %>% + summarize(n=n()) %>% spread(gender, n) ## ------------------------------------------------------------------------ install.packages('ggplot2') library(ggplot2) +## ------------------------------------------------------------------------ +dat$wday <- factor(weekdays(dat$timestamp, abbreviate = TRUE), + levels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun') + ) +summary(dat$wday) + ## ------------------------------------------------------------------------ qplot(instructor.communicated, data = dat) qplot(wday, course.delivered, data = dat) diff --git a/instructor/day_three.Rmd b/instructor/day_three.Rmd index 2904e1a..ecaf1bd 100644 --- a/instructor/day_three.Rmd +++ b/instructor/day_three.Rmd @@ -47,21 +47,28 @@ summary(dat) table(dat$department) ``` -think back to day one - how would we make weekdays out of the date variable? +## the `psych` package provides trimmed means, skew, kurtosis, and missingness ```{r} -dat$wday <- factor(weekdays(dat$timestamp, abbreviate = TRUE), - levels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun') - ) -summary(dat$wday) +library(psych) +describe(dat) +``` + +## you can use dplyr::groupby to generate summaries + +```{r} +library(dplyr) +dat %>% group_by(gender) %>% summarize(n()) ``` -## reshape provides a few more ways to aggregate things +## and you can combine dplyr with tidyr::spread to generate crosstabs + +> side note - we are filtering out missing values of gender, because `tidyr` doesn't allow `NA` as a column name ```{r} -library(reshape2) -dcast(dat[dat$gender == 'Female/Woman' | dat$gender == 'Male/Man',], department ~ gender) -dcast(melt(dat, measure.vars = c('course.delivered')), wday ~ 'Delivered', fun.aggregate = mean) +library(tidyr) +dat %>% filter(!is.na(gender)) %>% group_by(gender, department) %>% + summarize(n=n()) %>% spread(gender, n) ``` # Plotting @@ -88,6 +95,19 @@ install.packages('ggplot2') library(ggplot2) ``` +## getting weekdays + +let's imagine that we are interested in looking at differences in feedback based on the day of the week -- how would we do this in R? + +> side note - `weekdays` is locale aware, so students who have their laptop language set to something other than english will get their weekday names in the other language + +```{r} +dat$wday <- factor(weekdays(dat$timestamp, abbreviate = TRUE), + levels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun') + ) +summary(dat$wday) +``` + ## use qplot for initial poking around it has very strong intuitions about what you want to see, and is not particularly customizable diff --git a/instructor/day_three.html b/instructor/day_three.html index 2c5f77d..8bfb8ce 100644 --- a/instructor/day_three.html +++ b/instructor/day_three.html @@ -54,7 +54,7 @@

Day Three: Data Analysis

Dillon Niederhut

-

02 May, 2016

+

24 May, 2016

@@ -94,133 +94,166 @@

let’s load in some data a

R provides two easy/simple summary functions in the base package

summary(dat)
##    timestamp          course.delivered instructor.communicated
-##  Min.   :2014-08-19   Min.   :1.000    Min.   :1.000          
-##  1st Qu.:2014-11-05   1st Qu.:6.000    1st Qu.:6.000          
-##  Median :2015-01-30   Median :7.000    Median :7.000          
-##  Mean   :2015-01-22   Mean   :6.251    Mean   :6.257          
-##  3rd Qu.:2015-04-03   3rd Qu.:7.000    3rd Qu.:7.000          
-##  Max.   :2015-06-22   Max.   :7.000    Max.   :7.000          
-##                                                               
-##                                      hear        interest  
-##  Email from the D-Lab mailing list     :340   Min.   :1.0  
-##  Found it on the D-Lab website         :278   1st Qu.:6.0  
-##  Heard about it from a friend/colleague:247   Median :7.0  
-##  Email from another mailing list       : 99   Mean   :6.6  
-##  Don't remember                        : 12   3rd Qu.:7.0  
-##  (Other)                               : 55   Max.   :7.0  
-##  NA's                                  : 31   NA's   :15   
-##                department     verbs               useful    
-##  Public Health      : 81   Length:1062        Min.   :1.00  
-##  Public Policy      : 44   Class :character   1st Qu.:5.00  
-##  Sociology          : 38   Mode  :character   Median :6.00  
-##  Political Science  : 36                      Mean   :6.02  
-##  Integrative Biology: 28                      3rd Qu.:7.00  
-##  (Other)            :288                      Max.   :7.00  
-##  NA's               :547                                    
-##                                gender     ethnicity        
-##  Female/Woman                     :579   Length:1062       
-##  Male/Man                         :332   Class :character  
-##  Genderqueer/Gender non-conforming:  1   Mode  :character  
-##  NA's                             :150                     
-##                                                            
-##                                                            
-##                                                            
-##  outside.barriers inside.barriers what.barriers     
-##  Min.   :1.000    Min.   :1.000   Length:1062       
-##  1st Qu.:1.000    1st Qu.:1.000   Class :character  
-##  Median :1.000    Median :1.000   Mode  :character  
-##  Mean   :2.073    Mean   :1.259                     
-##  3rd Qu.:3.000    3rd Qu.:1.000                     
-##  Max.   :5.000    Max.   :5.000                     
-##  NA's   :167      NA's   :175                       
-##                             position  
-##  PhD student, dissertation stage: 41  
-##  PhD student, pre-dissertation  : 33  
-##  Visiting fellow or researcher  : 24  
-##  Masters student                : 22  
-##  Undergraduate student          : 21  
-##  (Other)                        : 64  
+##  Min.   :2014-08-19   Min.   :1.000    Min.   :1.000
+##  1st Qu.:2014-11-05   1st Qu.:6.000    1st Qu.:6.000
+##  Median :2015-01-30   Median :7.000    Median :7.000
+##  Mean   :2015-01-22   Mean   :6.251    Mean   :6.257
+##  3rd Qu.:2015-04-03   3rd Qu.:7.000    3rd Qu.:7.000
+##  Max.   :2015-06-22   Max.   :7.000    Max.   :7.000
+##
+##                                      hear        interest
+##  Email from the D-Lab mailing list     :340   Min.   :1.0
+##  Found it on the D-Lab website         :278   1st Qu.:6.0
+##  Heard about it from a friend/colleague:247   Median :7.0
+##  Email from another mailing list       : 99   Mean   :6.6
+##  Don't remember                        : 12   3rd Qu.:7.0
+##  (Other)                               : 55   Max.   :7.0
+##  NA's                                  : 31   NA's   :15
+##                department     verbs               useful
+##  Public Health      : 81   Length:1062        Min.   :1.00
+##  Public Policy      : 44   Class :character   1st Qu.:5.00
+##  Sociology          : 38   Mode  :character   Median :6.00
+##  Political Science  : 36                      Mean   :6.02
+##  Integrative Biology: 28                      3rd Qu.:7.00
+##  (Other)            :288                      Max.   :7.00
+##  NA's               :547
+##                                gender     ethnicity
+##  Female/Woman                     :579   Length:1062
+##  Male/Man                         :332   Class :character
+##  Genderqueer/Gender non-conforming:  1   Mode  :character
+##  NA's                             :150
+##
+##
+##
+##  outside.barriers inside.barriers what.barriers
+##  Min.   :1.000    Min.   :1.000   Length:1062
+##  1st Qu.:1.000    1st Qu.:1.000   Class :character
+##  Median :1.000    Median :1.000   Mode  :character
+##  Mean   :2.073    Mean   :1.259
+##  3rd Qu.:3.000    3rd Qu.:1.000
+##  Max.   :5.000    Max.   :5.000
+##  NA's   :167      NA's   :175
+##                             position
+##  PhD student, dissertation stage: 41
+##  PhD student, pre-dissertation  : 33
+##  Visiting fellow or researcher  : 24
+##  Masters student                : 22
+##  Undergraduate student          : 21
+##  (Other)                        : 64
 ##  NA's                           :857
table(dat$department)
-
## 
-##  African American Studies  Ag & Resource Econ & Pol 
-##                        24                        23 
-##              Anthropology   App Sci & Tech Grad Grp 
-##                        12                        10 
-##    Biostatistics Grad Grp  City & Regional Planning 
-##                         8                        20 
-##                 Economics                 Education 
-##                        23                        26 
-##  Energy & Resources Group   Env Sci, Policy, & Mgmt 
-##                        14                        17 
-##   Ethnic Studies Grad Grp                   History 
-##                         1                        17 
-## Industrial Eng & Ops Rsch               Information 
-##                         4                         9 
-##       Integrative Biology              JSP Grad Pgm 
-##                        28                         6 
-##                       Law               Linguistics 
-##                         9                        11 
-##                     Music              Neuroscience 
-##                         3                         4 
-##         Political Science                Psychology 
-##                        36                        28 
-##             Public Health             Public Policy 
-##                        81                        44 
-##                  Rhetoric    Slavic Languages & Lit 
-##                        11                         8 
-##                 Sociology 
+
##
+##  African American Studies  Ag & Resource Econ & Pol
+##                        24                        23
+##              Anthropology   App Sci & Tech Grad Grp
+##                        12                        10
+##    Biostatistics Grad Grp  City & Regional Planning
+##                         8                        20
+##                 Economics                 Education
+##                        23                        26
+##  Energy & Resources Group   Env Sci, Policy, & Mgmt
+##                        14                        17
+##   Ethnic Studies Grad Grp                   History
+##                         1                        17
+## Industrial Eng & Ops Rsch               Information
+##                         4                         9
+##       Integrative Biology              JSP Grad Pgm
+##                        28                         6
+##                       Law               Linguistics
+##                         9                        11
+##                     Music              Neuroscience
+##                         3                         4
+##         Political Science                Psychology
+##                        36                        28
+##             Public Health             Public Policy
+##                        81                        44
+##                  Rhetoric    Slavic Languages & Lit
+##                        11                         8
+##                 Sociology
 ##                        38
-

think back to day one - how would we make weekdays out of the date variable?

-
dat$wday <- factor(weekdays(dat$timestamp, abbreviate = TRUE), 
-                   levels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')
-                   )
-summary(dat$wday)
-
## Mon Tue Wed Thu Fri Sat Sun 
-## 168 124 144 323 277  16  10
-

reshape provides a few more ways to aggregate things

-
library(reshape2)
-dcast(dat[dat$gender == 'Female/Woman' | dat$gender == 'Male/Man',], department ~ gender)
-
## Using wday as value column: use value.var to override.
-## Aggregation function missing: defaulting to length
-
##                   department Female/Woman Male/Man  NA
-## 1   African American Studies            8       16   0
-## 2   Ag & Resource Econ & Pol           20        3   0
-## 3               Anthropology            9        3   0
-## 4    App Sci & Tech Grad Grp            6        4   0
-## 5     Biostatistics Grad Grp            5        3   0
-## 6   City & Regional Planning           12        7   0
-## 7                  Economics           16        5   0
-## 8                  Education           20        3   0
-## 9   Energy & Resources Group           10        3   0
-## 10   Env Sci, Policy, & Mgmt           11        5   0
-## 11   Ethnic Studies Grad Grp            1        0   0
-## 12                   History            9        6   0
-## 13 Industrial Eng & Ops Rsch            2        2   0
-## 14               Information            2        7   0
-## 15       Integrative Biology           20        8   0
-## 16              JSP Grad Pgm            5        1   0
-## 17                       Law            5        4   0
-## 18               Linguistics            8        1   0
-## 19                     Music            2        0   0
-## 20              Neuroscience            0        4   0
-## 21         Political Science           17       18   0
-## 22                Psychology           20        8   0
-## 23             Public Health           55       19   0
-## 24             Public Policy           22       21   0
-## 25                  Rhetoric            0       11   0
-## 26    Slavic Languages & Lit            7        1   0
-## 27                 Sociology           23       12   0
-## 28                      <NA>          264      157 150
-
dcast(melt(dat, measure.vars = c('course.delivered')), wday ~ 'Delivered', fun.aggregate = mean)
-
##   wday Delivered
-## 1  Mon  6.309524
-## 2  Tue  6.274194
-## 3  Wed  6.159722
-## 4  Thu  6.077399
-## 5  Fri  6.444043
-## 6  Sat  6.250000
-## 7  Sun  6.600000
+

the psych package provides trimmed means, skew, kurtosis, and missingness

+
library(psych)
+describe(dat)
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
+## Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
+## -Inf
+
##                         vars    n  mean   sd median trimmed  mad min  max
+## timestamp*                 1 1062   NaN   NA     NA     NaN   NA Inf -Inf
+## course.delivered           2 1062  6.25 1.11      7    6.47 0.00   1    7
+## instructor.communicated    3 1062  6.26 1.08      7    6.47 0.00   1    7
+## hear*                      4 1031 23.08 6.77     24   23.10 7.41   1   51
+## interest                   5 1047  6.60 0.80      7    6.79 0.00   1    7
+## department*                6  515 15.86 8.45     18   16.29 8.90   1   27
+## verbs*                     7  383   NaN   NA     NA     NaN   NA Inf -Inf
+## useful                     8 1062  6.02 1.20      6    6.23 1.48   1    7
+## gender*                    9  912  1.37 0.48      1    1.33 0.00   1    3
+## ethnicity*                10 1062   NaN   NA     NA     NaN   NA Inf -Inf
+## outside.barriers          11  895  2.07 1.29      1    1.89 0.00   1    5
+## inside.barriers           12  887  1.26 0.68      1    1.07 0.00   1    5
+## what.barriers*            13  120   NaN   NA     NA     NaN   NA Inf -Inf
+## position*                 14  205 13.14 5.84     14   13.70 2.97   1   23
+##                         range  skew kurtosis   se
+## timestamp*               -Inf    NA       NA   NA
+## course.delivered            6 -1.92     4.20 0.03
+## instructor.communicated     6 -1.92     4.35 0.03
+## hear*                      50  0.44     0.88 0.21
+## interest                    6 -2.84    11.11 0.02
+## department*                26 -0.37    -1.34 0.37
+## verbs*                   -Inf    NA       NA   NA
+## useful                      6 -1.57     2.89 0.04
+## gender*                     2  0.58    -1.58 0.02
+## ethnicity*               -Inf    NA       NA   NA
+## outside.barriers            4  0.87    -0.53 0.04
+## inside.barriers             4  2.93     8.62 0.02
+## what.barriers*           -Inf    NA       NA   NA
+## position*                  22 -0.85    -0.27 0.41
+

you can use dplyr::groupby to generate summaries

+
library(dplyr)
+dat %>% group_by(gender) %>% summarize(n())
+
## Source: local data frame [4 x 2]
+##
+##                              gender   n()
+##                              (fctr) (int)
+## 1                      Female/Woman   579
+## 2                          Male/Man   332
+## 3 Genderqueer/Gender non-conforming     1
+## 4                                NA   150
+

and you can combine dplyr with tidyr::spread to generate crosstabs

+
+

side note - we are filtering out missing values of gender, because tidyr doesn’t allow NA as a column name

+
+
library(tidyr)
+dat %>% filter(!is.na(gender)) %>% group_by(gender, department) %>% 
+  summarize(n=n()) %>% spread(gender, n)
+
## Source: local data frame [28 x 4]
+##
+##                  department Female/Woman Male/Man
+##                      (fctr)        (int)    (int)
+## 1  African American Studies            8       16
+## 2  Ag & Resource Econ & Pol           20        3
+## 3              Anthropology            9        3
+## 4   App Sci & Tech Grad Grp            6        4
+## 5    Biostatistics Grad Grp            5        3
+## 6  City & Regional Planning           12        7
+## 7                 Economics           16        5
+## 8                 Education           20        3
+## 9  Energy & Resources Group           10        3
+## 10  Env Sci, Policy, & Mgmt           11        5
+## ..                      ...          ...      ...
+## Variables not shown: Genderqueer/Gender non-conforming (int)

Plotting

@@ -239,10 +272,21 @@

install.packages('ggplot2')

-
## 
+
##
 ## The downloaded binary packages are in
-##  /var/folders/rj/8gpcssqd52z9yrqw7f8xxfym0000gn/T//Rtmp2xjYZ7/downloaded_packages
+## /var/folders/rj/8gpcssqd52z9yrqw7f8xxfym0000gn/T//RtmpmP1txl/downloaded_packages
library(ggplot2)
+

getting weekdays

+

let’s imagine that we are interested in looking at differences in feedback based on the day of the week – how would we do this in R?

+
+

side note - weekdays is locale aware, so students who have their laptop language set to something other than english will get their weekday names in the other language

+
+
dat$wday <- factor(weekdays(dat$timestamp, abbreviate = TRUE),
+                   levels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')
+                   )
+summary(dat$wday)
+
## Mon Tue Wed Thu Fri Sat Sun
+## 168 124 144 323 277  16  10

use qplot for initial poking around

it has very strong intuitions about what you want to see, and is not particularly customizable

qplot(instructor.communicated, data = dat)
@@ -323,45 +367,45 @@

Mean testing

we’ll start by trying to tell whether differences between group summaries are real

t.test with two vectors (default method)

t.test(dat$inside.barriers, dat$outside.barriers)
-
## 
+
##
 ##  Welch Two Sample t-test
-## 
+##
 ## data:  dat$inside.barriers and dat$outside.barriers
 ## t = -16.638, df = 1356.8, p-value < 2.2e-16
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
 ##  -0.9092224 -0.7174269
 ## sample estimates:
-## mean of x mean of y 
+## mean of x mean of y
 ##  1.259301  2.072626

note that R takes care of the defaults for you - what it is really computing is `t.test(dat\(inside.barriers, dat\)outside.barriers, alternative = “two.sided”, paired = FALSE, var.equal = FALSE, mu = 0, conf.level = 0.95)

how would you find this out for yourself?

t.test with subsets of one vector (default method)

t.test(dat$outside.barriers[dat$gender == "Male/Man"], dat$outside.barriers[dat$gender == "Female/Woman"])
-
## 
+
##
 ##  Welch Two Sample t-test
-## 
+##
 ## data:  dat$outside.barriers[dat$gender == "Male/Man"] and dat$outside.barriers[dat$gender == "Female/Woman"]
 ## t = -6.9925, df = 748.19, p-value = 5.993e-12
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
 ##  -0.7650033 -0.4296142
 ## sample estimates:
-## mean of x mean of y 
+## mean of x mean of y
 ##  1.702875  2.300184

recall that we mentioned inconsistency on day one - here it is, and in a big way

t.test with S3 method

t.test(outside.barriers ~ gender, data = dat, subset = dat$gender %in% c("Male/Man", "Female/Woman"))
-
## 
+
##
 ##  Welch Two Sample t-test
-## 
+##
 ## data:  outside.barriers by gender
 ## t = 6.9925, df = 748.19, p-value = 5.993e-12
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
 ##  0.4296142 0.7650033
 ## sample estimates:
-## mean in group Female/Woman     mean in group Male/Man 
+## mean in group Female/Woman     mean in group Male/Man
 ##                   2.300184                   1.702875

aov

first, you would think anova would be called by anova, but that’s reserved for conducting F-tests on lm objects

@@ -372,12 +416,12 @@

aov

aov(outside.barriers ~ gender, data = dat)
## Call:
 ##    aov(formula = outside.barriers ~ gender, data = dat)
-## 
+##
 ## Terms:
 ##                    gender Residuals
 ## Sum of Squares    79.3444 1363.4374
 ## Deg. of Freedom         2       854
-## 
+##
 ## Residual standard error: 1.263539
 ## Estimated effects may be unbalanced
 ## 205 observations deleted due to missingness
@@ -385,9 +429,9 @@

aov

remember our old friend summary? it works on almost everything

model.1 <- aov(outside.barriers ~ gender, data = dat)
 summary(model.1)
-
##              Df Sum Sq Mean Sq F value   Pr(>F)    
+
##              Df Sum Sq Mean Sq F value   Pr(>F)
 ## gender        2   79.3   39.67   24.85 3.24e-11 ***
-## Residuals   854 1363.4    1.60                     
+## Residuals   854 1363.4    1.60
 ## ---
 ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 ## 205 observations deleted due to missingness
@@ -395,9 +439,9 @@

aov

TukeyHSD(model.1)
##   Tukey multiple comparisons of means
 ##     95% family-wise confidence level
-## 
+##
 ## Fit: aov(formula = outside.barriers ~ gender, data = dat)
-## 
+##
 ## $gender
 ##                                                      diff        lwr
 ## Male/Man-Female/Woman                          -0.5973088 -0.8078392
@@ -418,16 +462,16 @@ 

cor.test (Pearson)

earlier, we were looking at differences between the means of two variables

but those variables were both continuous, so we can ask whether they are related

cor.test(dat$outside.barriers, dat$inside.barriers)
-
## 
+
##
 ##  Pearson's product-moment correlation
-## 
+##
 ## data:  dat$outside.barriers and dat$inside.barriers
 ## t = 15.558, df = 882, p-value < 2.2e-16
 ## alternative hypothesis: true correlation is not equal to 0
 ## 95 percent confidence interval:
 ##  0.4106679 0.5142422
 ## sample estimates:
-##       cor 
+##       cor
 ## 0.4640396

okay, so they’re related - now what?

lm

@@ -436,37 +480,37 @@

lm

the basic call is the S3 method

model.1 <- lm(inside.barriers ~ outside.barriers, data = dat)
 summary(model.1)
-
## 
+
##
 ## Call:
 ## lm(formula = inside.barriers ~ outside.barriers, data = dat)
-## 
+##
 ## Residuals:
-##      Min       1Q   Median       3Q      Max 
-## -0.98483 -0.24569  0.00069  0.00069  3.01517 
-## 
+##      Min       1Q   Median       3Q      Max
+## -0.98483 -0.24569  0.00069  0.00069  3.01517
+##
 ## Coefficients:
-##                  Estimate Std. Error t value Pr(>|t|)    
+##                  Estimate Std. Error t value Pr(>|t|)
 ## (Intercept)       0.75292    0.03842   19.60   <2e-16 ***
 ## outside.barriers  0.24638    0.01584   15.56   <2e-16 ***
 ## ---
 ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
+##
 ## Residual standard error: 0.6041 on 882 degrees of freedom
 ##   (178 observations deleted due to missingness)
-## Multiple R-squared:  0.2153, Adjusted R-squared:  0.2144 
+## Multiple R-squared:  0.2153, Adjusted R-squared:  0.2144
 ## F-statistic:   242 on 1 and 882 DF,  p-value: < 2.2e-16

R automatically one-hot encodes your categories

model.2 <- lm(inside.barriers ~ outside.barriers + department, data = dat)
 summary(model.2)
-
## 
+
##
 ## Call:
-## lm(formula = inside.barriers ~ outside.barriers + department, 
+## lm(formula = inside.barriers ~ outside.barriers + department,
 ##     data = dat)
-## 
+##
 ## Residuals:
-##      Min       1Q   Median       3Q      Max 
-## -1.20049 -0.36011 -0.04989  0.17705  2.91702 
-## 
+##      Min       1Q   Median       3Q      Max
+## -1.20049 -0.36011 -0.04989  0.17705  2.91702
+##
 ## Coefficients:
 ##                                     Estimate Std. Error t value Pr(>|t|)
 ## (Intercept)                          0.91782    0.14467   6.344 5.57e-10
@@ -497,54 +541,54 @@ 

R automatically one-hot ## departmentRhetoric 0.17521 0.24153 0.725 0.4686 ## departmentSlavic Languages & Lit -0.19495 0.26748 -0.729 0.4665 ## departmentSociology -0.34162 0.17664 -1.934 0.0537 -## +## ## (Intercept) *** ## outside.barriers *** -## departmentAg & Resource Econ & Pol * -## departmentAnthropology -## departmentApp Sci & Tech Grad Grp -## departmentBiostatistics Grad Grp -## departmentCity & Regional Planning -## departmentEconomics . -## departmentEducation -## departmentEnergy & Resources Group . -## departmentEnv Sci, Policy, & Mgmt -## departmentEthnic Studies Grad Grp -## departmentHistory -## departmentIndustrial Eng & Ops Rsch -## departmentInformation -## departmentIntegrative Biology . -## departmentJSP Grad Pgm -## departmentLaw -## departmentLinguistics -## departmentMusic -## departmentNeuroscience -## departmentPolitical Science -## departmentPsychology -## departmentPublic Health * -## departmentPublic Policy -## departmentRhetoric -## departmentSlavic Languages & Lit -## departmentSociology . +## departmentAg & Resource Econ & Pol * +## departmentAnthropology +## departmentApp Sci & Tech Grad Grp +## departmentBiostatistics Grad Grp +## departmentCity & Regional Planning +## departmentEconomics . +## departmentEducation +## departmentEnergy & Resources Group . +## departmentEnv Sci, Policy, & Mgmt +## departmentEthnic Studies Grad Grp +## departmentHistory +## departmentIndustrial Eng & Ops Rsch +## departmentInformation +## departmentIntegrative Biology . +## departmentJSP Grad Pgm +## departmentLaw +## departmentLinguistics +## departmentMusic +## departmentNeuroscience +## departmentPolitical Science +## departmentPsychology +## departmentPublic Health * +## departmentPublic Policy +## departmentRhetoric +## departmentSlavic Languages & Lit +## departmentSociology . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 -## +## ## Residual standard error: 0.6462 on 440 degrees of freedom ## (594 observations deleted due to missingness) -## Multiple R-squared: 0.2759, Adjusted R-squared: 0.2314 +## Multiple R-squared: 0.2759, Adjusted R-squared: 0.2314 ## F-statistic: 6.209 on 27 and 440 DF, p-value: < 2.2e-16

R does not assume you want the full factorial model

model.3 <- lm(inside.barriers ~ outside.barriers + department + outside.barriers*department, data = dat)
 summary(model.3)
-
## 
+
##
 ## Call:
-## lm(formula = inside.barriers ~ outside.barriers + department + 
+## lm(formula = inside.barriers ~ outside.barriers + department +
 ##     outside.barriers * department, data = dat)
-## 
+##
 ## Residuals:
-##      Min       1Q   Median       3Q      Max 
-## -1.75495 -0.25924  0.00000  0.05784  2.80608 
-## 
+##      Min       1Q   Median       3Q      Max
+## -1.75495 -0.25924  0.00000  0.05784  2.80608
+##
 ## Coefficients: (3 not defined because of singularities)
 ##                                                        Estimate Std. Error
 ## (Intercept)                                           0.3378995  0.2274560
@@ -601,71 +645,71 @@ 

R does not assume y ## outside.barriers:departmentRhetoric 2.1457382 0.4109273 ## outside.barriers:departmentSlavic Languages & Lit NA NA ## outside.barriers:departmentSociology -0.4996106 0.1372998 -## t value Pr(>|t|) -## (Intercept) 1.486 0.138151 +## t value Pr(>|t|) +## (Intercept) 1.486 0.138151 ## outside.barriers 5.636 3.22e-08 *** -## departmentAg & Resource Econ & Pol 1.634 0.102964 -## departmentAnthropology 0.234 0.814794 -## departmentApp Sci & Tech Grad Grp 0.000 0.999802 -## departmentBiostatistics Grad Grp -1.350 0.177895 -## departmentCity & Regional Planning 0.836 0.403848 -## departmentEconomics 2.013 0.044719 * -## departmentEducation 0.359 0.719428 -## departmentEnergy & Resources Group 1.609 0.108295 -## departmentEnv Sci, Policy, & Mgmt 0.379 0.704979 -## departmentEthnic Studies Grad Grp -0.911 0.362671 -## departmentHistory -0.480 0.631266 -## departmentIndustrial Eng & Ops Rsch 0.405 0.685368 -## departmentInformation 0.532 0.595352 -## departmentIntegrative Biology 1.694 0.091057 . -## departmentJSP Grad Pgm -0.595 0.552372 -## departmentLaw 1.712 0.087560 . -## departmentLinguistics 1.932 0.054032 . -## departmentMusic -1.261 0.208134 -## departmentNeuroscience 0.717 0.473811 -## departmentPolitical Science 1.203 0.229669 -## departmentPsychology 2.158 0.031503 * -## departmentPublic Health 1.668 0.096018 . -## departmentPublic Policy 1.009 0.313790 +## departmentAg & Resource Econ & Pol 1.634 0.102964 +## departmentAnthropology 0.234 0.814794 +## departmentApp Sci & Tech Grad Grp 0.000 0.999802 +## departmentBiostatistics Grad Grp -1.350 0.177895 +## departmentCity & Regional Planning 0.836 0.403848 +## departmentEconomics 2.013 0.044719 * +## departmentEducation 0.359 0.719428 +## departmentEnergy & Resources Group 1.609 0.108295 +## departmentEnv Sci, Policy, & Mgmt 0.379 0.704979 +## departmentEthnic Studies Grad Grp -0.911 0.362671 +## departmentHistory -0.480 0.631266 +## departmentIndustrial Eng & Ops Rsch 0.405 0.685368 +## departmentInformation 0.532 0.595352 +## departmentIntegrative Biology 1.694 0.091057 . +## departmentJSP Grad Pgm -0.595 0.552372 +## departmentLaw 1.712 0.087560 . +## departmentLinguistics 1.932 0.054032 . +## departmentMusic -1.261 0.208134 +## departmentNeuroscience 0.717 0.473811 +## departmentPolitical Science 1.203 0.229669 +## departmentPsychology 2.158 0.031503 * +## departmentPublic Health 1.668 0.096018 . +## departmentPublic Policy 1.009 0.313790 ## departmentRhetoric -5.518 6.03e-08 *** -## departmentSlavic Languages & Lit 0.226 0.821167 -## departmentSociology 2.034 0.042571 * +## departmentSlavic Languages & Lit 0.226 0.821167 +## departmentSociology 2.034 0.042571 * ## outside.barriers:departmentAg & Resource Econ & Pol -3.649 0.000297 *** -## outside.barriers:departmentAnthropology -1.118 0.264358 -## outside.barriers:departmentApp Sci & Tech Grad Grp 0.006 0.995116 -## outside.barriers:departmentBiostatistics Grad Grp 1.303 0.193236 -## outside.barriers:departmentCity & Regional Planning -1.007 0.314480 +## outside.barriers:departmentAnthropology -1.118 0.264358 +## outside.barriers:departmentApp Sci & Tech Grad Grp 0.006 0.995116 +## outside.barriers:departmentBiostatistics Grad Grp 1.303 0.193236 +## outside.barriers:departmentCity & Regional Planning -1.007 0.314480 ## outside.barriers:departmentEconomics -3.395 0.000752 *** -## outside.barriers:departmentEducation -1.409 0.159722 -## outside.barriers:departmentEnergy & Resources Group -3.251 0.001242 ** -## outside.barriers:departmentEnv Sci, Policy, & Mgmt -0.852 0.394793 -## outside.barriers:departmentEthnic Studies Grad Grp NA NA -## outside.barriers:departmentHistory 1.036 0.300571 -## outside.barriers:departmentIndustrial Eng & Ops Rsch -1.034 0.301967 -## outside.barriers:departmentInformation -1.137 0.256041 -## outside.barriers:departmentIntegrative Biology -3.273 0.001154 ** -## outside.barriers:departmentJSP Grad Pgm 0.826 0.409544 +## outside.barriers:departmentEducation -1.409 0.159722 +## outside.barriers:departmentEnergy & Resources Group -3.251 0.001242 ** +## outside.barriers:departmentEnv Sci, Policy, & Mgmt -0.852 0.394793 +## outside.barriers:departmentEthnic Studies Grad Grp NA NA +## outside.barriers:departmentHistory 1.036 0.300571 +## outside.barriers:departmentIndustrial Eng & Ops Rsch -1.034 0.301967 +## outside.barriers:departmentInformation -1.137 0.256041 +## outside.barriers:departmentIntegrative Biology -3.273 0.001154 ** +## outside.barriers:departmentJSP Grad Pgm 0.826 0.409544 ## outside.barriers:departmentLaw -3.329 0.000950 *** -## outside.barriers:departmentLinguistics -3.011 0.002758 ** -## outside.barriers:departmentMusic NA NA -## outside.barriers:departmentNeuroscience -0.882 0.378243 -## outside.barriers:departmentPolitical Science -2.181 0.029771 * -## outside.barriers:departmentPsychology -3.046 0.002465 ** +## outside.barriers:departmentLinguistics -3.011 0.002758 ** +## outside.barriers:departmentMusic NA NA +## outside.barriers:departmentNeuroscience -0.882 0.378243 +## outside.barriers:departmentPolitical Science -2.181 0.029771 * +## outside.barriers:departmentPsychology -3.046 0.002465 ** ## outside.barriers:departmentPublic Health -3.660 0.000284 *** -## outside.barriers:departmentPublic Policy -1.996 0.046612 * +## outside.barriers:departmentPublic Policy -1.996 0.046612 * ## outside.barriers:departmentRhetoric 5.222 2.80e-07 *** -## outside.barriers:departmentSlavic Languages & Lit NA NA +## outside.barriers:departmentSlavic Languages & Lit NA NA ## outside.barriers:departmentSociology -3.639 0.000308 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 -## +## ## Residual standard error: 0.586 on 417 degrees of freedom ## (594 observations deleted due to missingness) -## Multiple R-squared: 0.4357, Adjusted R-squared: 0.368 +## Multiple R-squared: 0.4357, Adjusted R-squared: 0.368 ## F-statistic: 6.439 on 50 and 417 DF, p-value: < 2.2e-16

extract model parameters with $

model.1$coefficients
-
##      (Intercept) outside.barriers 
+
##      (Intercept) outside.barriers
 ##        0.7529250        0.2463815
model.1$coefficients[[2]]
## [1] 0.2463815
@@ -696,9 +740,9 @@

ranked variables

median testing ranks

we use Mann-Whitney sums to test that the ranks are centered the same way

wilcox.test(dat$outside.barriers, dat$inside.barriers, alternative = "two.sided", paired = FALSE, mu = 0, conf.level = 0.95)
-
## 
+
##
 ##  Wilcoxon rank sum test with continuity correction
-## 
+##
 ## data:  dat$outside.barriers and dat$inside.barriers
 ## W = 541240, p-value < 2.2e-16
 ## alternative hypothesis: true location shift is not equal to 0
@@ -708,14 +752,14 @@

correlating ranks

cor.test(dat$outside.barriers, dat$inside.barriers, method = 'spearman')
## Warning in cor.test.default(dat$outside.barriers, dat$inside.barriers,
 ## method = "spearman"): Cannot compute exact p-value with ties
-
## 
+
##
 ##  Spearman's rank correlation rho
-## 
+##
 ## data:  dat$outside.barriers and dat$inside.barriers
 ## S = 63037000, p-value < 2.2e-16
 ## alternative hypothesis: true rho is not equal to 0
 ## sample estimates:
-##       rho 
+##       rho
 ## 0.4524909

rho is pretty close to the r from above

chisq

@@ -724,9 +768,9 @@

chisq

chisq.test(dat$gender, dat$department)
## Warning in chisq.test(dat$gender, dat$department): Chi-squared
 ## approximation may be incorrect
-
## 
+
##
 ##  Pearson's Chi-squared test
-## 
+##
 ## data:  dat$gender and dat$department
 ## X-squared = 76.442, df = 26, p-value = 7.326e-07

diff --git a/instructor/day_two.html b/instructor/day_two.html index adca61f..5a8d1bb 100644 --- a/instructor/day_two.html +++ b/instructor/day_two.html @@ -55,7 +55,7 @@

Day Two: Data Cleaning

Dillon Niederhut
Shinhye Choi

-

02 May, 2016

+

24 May, 2016

Review

@@ -66,28 +66,28 @@

Inspecting objects

Inspecting variables

We should see 50 levels in this division variable

state.division
-
##  [1] East South Central Pacific            Mountain          
-##  [4] West South Central Pacific            Mountain          
-##  [7] New England        South Atlantic     South Atlantic    
-## [10] South Atlantic     Pacific            Mountain          
+
##  [1] East South Central Pacific            Mountain
+##  [4] West South Central Pacific            Mountain
+##  [7] New England        South Atlantic     South Atlantic
+## [10] South Atlantic     Pacific            Mountain
 ## [13] East North Central East North Central West North Central
 ## [16] West North Central East South Central West South Central
-## [19] New England        South Atlantic     New England       
+## [19] New England        South Atlantic     New England
 ## [22] East North Central West North Central East South Central
 ## [25] West North Central Mountain           West North Central
-## [28] Mountain           New England        Middle Atlantic   
-## [31] Mountain           Middle Atlantic    South Atlantic    
+## [28] Mountain           New England        Middle Atlantic
+## [31] Mountain           Middle Atlantic    South Atlantic
 ## [34] West North Central East North Central West South Central
-## [37] Pacific            Middle Atlantic    New England       
+## [37] Pacific            Middle Atlantic    New England
 ## [40] South Atlantic     West North Central East South Central
-## [43] West South Central Mountain           New England       
-## [46] South Atlantic     Pacific            South Atlantic    
-## [49] East North Central Mountain          
+## [43] West South Central Mountain           New England
+## [46] South Atlantic     Pacific            South Atlantic
+## [49] East North Central Mountain
 ## 9 Levels: New England Middle Atlantic ... Pacific
length(state.division)
## [1] 50
levels(state.division)
-
## [1] "New England"        "Middle Atlantic"    "South Atlantic"    
+
## [1] "New England"        "Middle Atlantic"    "South Atlantic"
 ## [4] "East South Central" "West South Central" "East North Central"
 ## [7] "West North Central" "Mountain"           "Pacific"

Inspecting data frames

@@ -190,7 +190,7 @@

str(dirty)

## 'data.frame':    5 obs. of  5 variables:
 ##  $ Timestamp                  : Factor w/ 5 levels "7/25/2015 10:08:41",..: 1 2 3 4 5
-##  $ How.tall.are.you.          : Factor w/ 5 levels "156","2.1","5’9",..: 5 4 3 2 1
+##  $ How.tall.are.you.          : Factor w/ 5 levels "156","2.1","5'9",..: 5 4 3 2 1
 ##  $ What.department.are.you.in.: Factor w/ 5 levels "  geology","999",..: 4 2 1 5 3
 ##  $ Are.you.currently.enrolled.: Factor w/ 3 levels "999","No","Yes": 3 3 1 2 1
 ##  $ What.is.your.birth.order.  : Factor w/ 3 levels "1","2","9,000": 1 1 2 3 2
@@ -200,16 +200,16 @@

it’s usua str(dirty)

## 'data.frame':    5 obs. of  5 variables:
 ##  $ Timestamp                  : chr  "7/25/2015 10:08:41" "7/25/2015 10:10:56" "7/25/2015 10:11:20" "7/25/2015 10:11:25" ...
-##  $ How.tall.are.you.          : chr  "very" "70" "5’9" "2.1" ...
+##  $ How.tall.are.you.          : chr  "very" "70" "5'9" "2.1" ...
 ##  $ What.department.are.you.in.: chr  "Geology  " "999" "  geology" "goelogy" ...
 ##  $ Are.you.currently.enrolled.: chr  "Yes" "Yes" "999" "No" ...
 ##  $ What.is.your.birth.order.  : chr  "1" "1" "2" "9,000" ...

let’s start by removing the empty rows and columns

tail(dirty)
##            Timestamp How.tall.are.you. What.department.are.you.in.
-## 1 7/25/2015 10:08:41              very                   Geology  
+## 1 7/25/2015 10:08:41              very                   Geology
 ## 2 7/25/2015 10:10:56                70                         999
-## 3 7/25/2015 10:11:20               5’9                     geology
+## 3 7/25/2015 10:11:20               5'9                     geology
 ## 4 7/25/2015 10:11:25               2.1                     goelogy
 ## 5 7/25/2015 10:11:29               156                      anthro
 ##   Are.you.currently.enrolled. What.is.your.birth.order.
@@ -224,7 +224,7 @@ 

let’s start by remo

you can replace variable names

and you should, if they are uninformative or long

names(dirty)
-
## [1] "Timestamp"                   "How.tall.are.you."          
+
## [1] "Timestamp"                   "How.tall.are.you."
 ## [3] "What.department.are.you.in." "Are.you.currently.enrolled."
 ## [5] "What.is.your.birth.order."
names(dirty) <- c("time", "height", "dept", "enroll", "birth.order")
@@ -234,13 +234,13 @@

you should replace all of these values in your dataframe with R’s missingness signifier, NA

table(dirty$enroll)
-
## 
-## 999  No Yes 
+
##
+## 999  No Yes
 ##   2   1   2
dirty$enroll[dirty$enroll=="999"] <- NA
 table(dirty$enroll, useNA = "ifany")
-
## 
-##   No  Yes <NA> 
+
##
+##   No  Yes <NA>
 ##    1    2    2

side note - read.table() has an option to specify field values as NA as soon as you import the data, but this is a BAAAAD idea because R automatically encodes blank fields as missing too, and thus you lose the ability to distinguish between user-missing and experimenter-missing

@@ -310,27 +310,27 @@

remember how we tal

let’s use this large dataset as an example

large <- read.csv('data/large.csv')
 summary(large)
-
##        a                   b               c             
-##  Min.   :-33.98426   Min.   :-13.4   Min.   :-249998.64  
-##  1st Qu.: -6.71903   1st Qu.:128.6   1st Qu.:-141005.65  
-##  Median :  0.41681   Median :256.9   Median : -63498.56  
-##  Mean   :  0.00176   Mean   :252.2   Mean   : -83954.09  
-##  3rd Qu.:  7.00630   3rd Qu.:377.5   3rd Qu.: -15748.98  
-##  Max.   : 35.33306   Max.   :513.3   Max.   :     11.77  
+
##        a                   b               c
+##  Min.   :-33.98426   Min.   :-13.4   Min.   :-249998.64
+##  1st Qu.: -6.71903   1st Qu.:128.6   1st Qu.:-141005.65
+##  Median :  0.41681   Median :256.9   Median : -63498.56
+##  Mean   :  0.00176   Mean   :252.2   Mean   : -83954.09
+##  3rd Qu.:  7.00630   3rd Qu.:377.5   3rd Qu.: -15748.98
+##  Max.   : 35.33306   Max.   :513.3   Max.   :     11.77
 ##  NA's   :45          NA's   :45      NA's   :45
nrow(na.omit(large))
## [1] 871

for it to work you need low missingness and large N

a <- amelia(large,m = 1)
## -- Imputation 1 --
-## 
+##
 ##   1  2  3
print(a)
-
## 
+
##
 ## Amelia output with 1 imputed datasets.
-## Return code:  1 
-## Message:  Normal EM convergence. 
-## 
+## Return code:  1
+## Message:  Normal EM convergence.
+##
 ## Chain Lengths:
 ## --------------
 ## Imputation 1:  3
@@ -338,12 +338,12 @@

large.imputed <- a[[1]][[1]]
 summary(large.imputed)
-
##        a                   b               c          
-##  Min.   :-33.98426   Min.   :-13.4   Min.   :-249999  
-##  1st Qu.: -6.73649   1st Qu.:126.5   1st Qu.:-140641  
-##  Median :  0.30970   Median :252.0   Median : -63513  
-##  Mean   : -0.01213   Mean   :250.0   Mean   : -83156  
-##  3rd Qu.:  6.99412   3rd Qu.:373.9   3rd Qu.: -15561  
+
##        a                   b               c
+##  Min.   :-33.98426   Min.   :-13.4   Min.   :-249999
+##  1st Qu.: -6.73649   1st Qu.:126.5   1st Qu.:-140641
+##  Median :  0.30970   Median :252.0   Median : -63513
+##  Mean   : -0.01213   Mean   :250.0   Mean   : -83156
+##  3rd Qu.:  6.99412   3rd Qu.:373.9   3rd Qu.: -15561
 ##  Max.   : 35.33306   Max.   :518.7   Max.   :  69498

if you give it a tiny dataset, it will fuss at you

a <- amelia(large[990:1000,],m = 1)
@@ -352,14 +352,14 @@

if you give it a tiny ## variables in the imputation model. Consider removing some variables, or ## reducing the order of time polynomials to reduce the number of parameters.

## -- Imputation 1 --
-## 
+##
 ##   1  2
print(a)
-
## 
+
##
 ## Amelia output with 1 imputed datasets.
-## Return code:  1 
-## Message:  Normal EM convergence. 
-## 
+## Return code:  1
+## Message:  Normal EM convergence.
+##
 ## Chain Lengths:
 ## --------------
 ## Imputation 1:  2
@@ -404,10 +404,10 @@

subsetting data frames

my.data$numeric == 2
## logical(0)
my.data[my.data$numeric == 2,]
-
## [1] n                                        
-## [2] c                                        
-## [3] b                                        
-## [4] d                                        
+
## [1] n
+## [2] c
+## [3] b
+## [4] d
 ## [5] really.long.and.complicated.variable.name
 ## <0 rows> (or 0-length row.names)

boolean variables can act as filters right out of the box

@@ -423,19 +423,19 @@

you can also select columns

you can also match elements from a vector

good.things <- c("three", "four", "five")
 my.data[my.data$character %in% good.things, ]
-
## [1] n                                        
-## [2] c                                        
-## [3] b                                        
-## [4] d                                        
+
## [1] n
+## [2] c
+## [3] b
+## [4] d
 ## [5] really.long.and.complicated.variable.name
 ## <0 rows> (or 0-length row.names)

most subsetting operations on dataframes also return a dataframe

str(my.data[!(my.data$character %in% good.things), ])
## 'data.frame':    0 obs. of  5 variables:
-##  $ n                                        : num 
-##  $ c                                        : Factor w/ 3 levels "one","three",..: 
-##  $ b                                        : logi 
-##  $ d                                        :Class 'Date'  num(0) 
+##  $ n                                        : num
+##  $ c                                        : Factor w/ 3 levels "one","three",..:
+##  $ b                                        : logi
+##  $ d                                        :Class 'Date'  num(0)
 ##  $ really.long.and.complicated.variable.name: num

subsets that are a single column return a vector

str(my.data$numeric)
@@ -484,16 +484,16 @@

reshaping

side note - don’t worry about how this works yet - we’ll talk about it tomorrow

t.test(score ~ time, data=normal)
-
## 
+
##
 ##  Welch Two Sample t-test
-## 
+##
 ## data:  score by time
 ## t = 0.58132, df = 2.0278, p-value = 0.6191
 ## alternative hypothesis: true difference in means is not equal to 0
 ## 95 percent confidence interval:
 ##  -73.56101  96.89434
 ## sample estimates:
-## mean in group 1 mean in group 2 
+## mean in group 1 mean in group 2
 ##       110.00000        98.33333

it’s easy to combine tidy tables to compare different levels of information simultaneously

@@ -608,7 +608,7 @@

dplyr allows you to apply
group_by(normal, time)
## Source: local data frame [6 x 4]
 ## Groups: time [2]
-## 
+##
 ##     name  time score    id
 ##   (fctr) (dbl) (dbl) (int)
 ## 1  Alice     1    90     1
@@ -619,7 +619,7 @@ 

dplyr allows you to apply ## 6 Eve 2 100 6

summarize(group_by(normal, time), mean(score))
## Source: local data frame [2 x 2]
-## 
+##
 ##    time mean(score)
 ##   (dbl)       (dbl)
 ## 1     1   110.00000
@@ -627,7 +627,7 @@ 

dplyr allows you to apply
mutate(group_by(normal, time), diff=score-mean(score))
## Source: local data frame [6 x 5]
 ## Groups: time [2]
-## 
+##
 ##     name  time score    id       diff
 ##   (fctr) (dbl) (dbl) (int)      (dbl)
 ## 1  Alice     1    90     1 -20.000000
@@ -638,7 +638,7 @@ 

dplyr allows you to apply ## 6 Eve 2 100 6 1.666667

ungroup(mutate(group_by(normal, time), diff=score-mean(score)))
## Source: local data frame [6 x 5]
-## 
+##
 ##     name  time score    id       diff
 ##   (fctr) (dbl) (dbl) (int)      (dbl)
 ## 1  Alice     1    90     1 -20.000000
diff --git a/instructor/overflow.html b/instructor/overflow.html
index 7941bf6..b7c300b 100644
--- a/instructor/overflow.html
+++ b/instructor/overflow.html
@@ -54,7 +54,7 @@ 

Additional Course Materials

Dillon Niederhut

-

02 May, 2016

+

24 May, 2016

@@ -70,87 +70,87 @@

R has an interface to curl call library(XML)

you can use this to access remote data

you may just want to read text lines from a webpage

-
RJ <- readLines("http://shakespeare.mit.edu/romeo_juliet/full.html")  
+
RJ <- readLines("http://shakespeare.mit.edu/romeo_juliet/full.html")
 RJ[1:25]
-
##  [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\""              
-##  [2] " \"http://www.w3.org/TR/REC-html40/loose.dtd\">"                              
-##  [3] " <html>"                                                                      
-##  [4] " <head>"                                                                      
-##  [5] " <title>Romeo and Juliet: Entire Play"                                        
-##  [6] " </title>"                                                                    
+
##  [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\""
+##  [2] " \"http://www.w3.org/TR/REC-html40/loose.dtd\">"
+##  [3] " <html>"
+##  [4] " <head>"
+##  [5] " <title>Romeo and Juliet: Entire Play"
+##  [6] " </title>"
 ##  [7] " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">"
-##  [8] " <LINK rel=\"stylesheet\" type=\"text/css\" media=\"screen\""                 
-##  [9] "       href=\"/shake.css\">"                                                  
-## [10] " </HEAD>"                                                                     
-## [11] " <body bgcolor=\"#ffffff\" text=\"#000000\">"                                 
-## [12] ""                                                                             
-## [13] "<table width=\"100%\" bgcolor=\"#CCF6F6\">"                                   
-## [14] "<tr><td class=\"play\" align=\"center\">Romeo and Juliet"                     
-## [15] "<tr><td class=\"nav\" align=\"center\">"                                      
-## [16] "      <a href=\"/Shakespeare\">Shakespeare homepage</A> "                     
-## [17] "    | <A href=\"/romeo_juliet/\">Romeo and Juliet</A> "                       
-## [18] "    | Entire play"                                                            
-## [19] "</table>"                                                                     
-## [20] ""                                                                             
-## [21] "<H3>ACT I</h3>"                                                               
-## [22] "<h3>PROLOGUE</h3>"                                                            
-## [23] "<blockquote>"                                                                 
-## [24] "<A NAME=1.0.1>Two households, both alike in dignity,</A><br>"                 
+##  [8] " <LINK rel=\"stylesheet\" type=\"text/css\" media=\"screen\""
+##  [9] "       href=\"/shake.css\">"
+## [10] " </HEAD>"
+## [11] " <body bgcolor=\"#ffffff\" text=\"#000000\">"
+## [12] ""
+## [13] "<table width=\"100%\" bgcolor=\"#CCF6F6\">"
+## [14] "<tr><td class=\"play\" align=\"center\">Romeo and Juliet"
+## [15] "<tr><td class=\"nav\" align=\"center\">"
+## [16] "      <a href=\"/Shakespeare\">Shakespeare homepage</A> "
+## [17] "    | <A href=\"/romeo_juliet/\">Romeo and Juliet</A> "
+## [18] "    | Entire play"
+## [19] "</table>"
+## [20] ""
+## [21] "<H3>ACT I</h3>"
+## [22] "<h3>PROLOGUE</h3>"
+## [23] "<blockquote>"
+## [24] "<A NAME=1.0.1>Two households, both alike in dignity,</A><br>"
 ## [25] "<A NAME=1.0.2>In fair Verona, where we lay our scene,</A><br>"

and use the kinds of string manipulation we learned yesterday to retrieve the first lines of an act or a scene

RJ[grep("<h3>", RJ, perl=T)]
-
##  [1] "<h3>PROLOGUE</h3>"                                                        
-##  [2] "<h3>SCENE I. Verona. A public place.</h3>"                                
-##  [3] "<h3>SCENE II. A street.</h3>"                                             
-##  [4] "<h3>SCENE III. A room in Capulet's house.</h3>"                           
-##  [5] "<h3>SCENE IV. A street.</h3>"                                             
-##  [6] "<h3>SCENE V. A hall in Capulet's house.</h3>"                             
-##  [7] "<h3>PROLOGUE</h3>"                                                        
-##  [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>"               
-##  [9] "<h3>SCENE II. Capulet's orchard.</h3>"                                    
-## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>"                               
-## [11] "<h3>SCENE IV. A street.</h3>"                                             
-## [12] "<h3>SCENE V. Capulet's orchard.</h3>"                                     
-## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>"                                
-## [14] "<h3>SCENE I. A public place.</h3>"                                        
-## [15] "<h3>SCENE II. Capulet's orchard.</h3>"                                    
-## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>"                               
-## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>"                            
-## [18] "<h3>SCENE V. Capulet's orchard.</h3>"                                     
-## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>"                                 
-## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>"                              
-## [21] "<h3>SCENE III. Juliet's chamber.</h3>"                                    
-## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>"                              
-## [23] "<h3>SCENE V. Juliet's chamber.</h3>"                                      
-## [24] "<h3>SCENE I. Mantua. A street.</h3>"                                      
-## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>"                                
+
##  [1] "<h3>PROLOGUE</h3>"
+##  [2] "<h3>SCENE I. Verona. A public place.</h3>"
+##  [3] "<h3>SCENE II. A street.</h3>"
+##  [4] "<h3>SCENE III. A room in Capulet's house.</h3>"
+##  [5] "<h3>SCENE IV. A street.</h3>"
+##  [6] "<h3>SCENE V. A hall in Capulet's house.</h3>"
+##  [7] "<h3>PROLOGUE</h3>"
+##  [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>"
+##  [9] "<h3>SCENE II. Capulet's orchard.</h3>"
+## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>"
+## [11] "<h3>SCENE IV. A street.</h3>"
+## [12] "<h3>SCENE V. Capulet's orchard.</h3>"
+## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>"
+## [14] "<h3>SCENE I. A public place.</h3>"
+## [15] "<h3>SCENE II. Capulet's orchard.</h3>"
+## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>"
+## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>"
+## [18] "<h3>SCENE V. Capulet's orchard.</h3>"
+## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>"
+## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>"
+## [21] "<h3>SCENE III. Juliet's chamber.</h3>"
+## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>"
+## [23] "<h3>SCENE V. Juliet's chamber.</h3>"
+## [24] "<h3>SCENE I. Mantua. A street.</h3>"
+## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>"
 ## [26] "<h3>SCENE III. A churchyard; in it a tomb belonging to the Capulets.</h3>"
RJ[grep("<h3>", RJ, perl=TRUE)]
-
##  [1] "<h3>PROLOGUE</h3>"                                                        
-##  [2] "<h3>SCENE I. Verona. A public place.</h3>"                                
-##  [3] "<h3>SCENE II. A street.</h3>"                                             
-##  [4] "<h3>SCENE III. A room in Capulet's house.</h3>"                           
-##  [5] "<h3>SCENE IV. A street.</h3>"                                             
-##  [6] "<h3>SCENE V. A hall in Capulet's house.</h3>"                             
-##  [7] "<h3>PROLOGUE</h3>"                                                        
-##  [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>"               
-##  [9] "<h3>SCENE II. Capulet's orchard.</h3>"                                    
-## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>"                               
-## [11] "<h3>SCENE IV. A street.</h3>"                                             
-## [12] "<h3>SCENE V. Capulet's orchard.</h3>"                                     
-## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>"                                
-## [14] "<h3>SCENE I. A public place.</h3>"                                        
-## [15] "<h3>SCENE II. Capulet's orchard.</h3>"                                    
-## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>"                               
-## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>"                            
-## [18] "<h3>SCENE V. Capulet's orchard.</h3>"                                     
-## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>"                                 
-## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>"                              
-## [21] "<h3>SCENE III. Juliet's chamber.</h3>"                                    
-## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>"                              
-## [23] "<h3>SCENE V. Juliet's chamber.</h3>"                                      
-## [24] "<h3>SCENE I. Mantua. A street.</h3>"                                      
-## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>"                                
+
##  [1] "<h3>PROLOGUE</h3>"
+##  [2] "<h3>SCENE I. Verona. A public place.</h3>"
+##  [3] "<h3>SCENE II. A street.</h3>"
+##  [4] "<h3>SCENE III. A room in Capulet's house.</h3>"
+##  [5] "<h3>SCENE IV. A street.</h3>"
+##  [6] "<h3>SCENE V. A hall in Capulet's house.</h3>"
+##  [7] "<h3>PROLOGUE</h3>"
+##  [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>"
+##  [9] "<h3>SCENE II. Capulet's orchard.</h3>"
+## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>"
+## [11] "<h3>SCENE IV. A street.</h3>"
+## [12] "<h3>SCENE V. Capulet's orchard.</h3>"
+## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>"
+## [14] "<h3>SCENE I. A public place.</h3>"
+## [15] "<h3>SCENE II. Capulet's orchard.</h3>"
+## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>"
+## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>"
+## [18] "<h3>SCENE V. Capulet's orchard.</h3>"
+## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>"
+## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>"
+## [21] "<h3>SCENE III. Juliet's chamber.</h3>"
+## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>"
+## [23] "<h3>SCENE V. Juliet's chamber.</h3>"
+## [24] "<h3>SCENE I. Mantua. A street.</h3>"
+## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>"
 ## [26] "<h3>SCENE III. A churchyard; in it a tomb belonging to the Capulets.</h3>"

or maybe pull information out of an RSS feed

link <- "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"
@@ -210,20 +210,20 @@ 

Connecting to a database

install.packages("RPostgreSQL")
 library(RPostgreSQL)
 con <- dbConnect(dbDriver("PostgreSQL"),
-                 dbname="", 
+                 dbname="",
                  host="localhost",
-                 port=1234, 
-                 user="", 
+                 port=1234,
+                 user="",
                  password="")
 data <- dbReadTable(con, c("column1","column2"))
 dbDisconnect(con)

a popular non-relational database is MongoDB

install.packages("rmongodb")
 library(rmongodb)
-con <- mongo.create(host = localhost, 
-                      name = "", 
-                      username = "", 
-                      password = "", 
+con <- mongo.create(host = localhost,
+                      name = "",
+                      username = "",
+                      password = "",
                       db = "admin")
 if(mongo.is.connected(con) == TRUE) {
   data <- mongo.find.all(con, "collection", list("city" = list( "$exists" = "true")))
@@ -260,7 +260,7 @@ 

group-wise operations/plyr mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") # Consider the case where we want to calculate descriptive statistics across admits and not-admits # from the dataset and return them as a data.frame -ddata <- ddply(mydata, c("admit"), summarize, +ddata <- ddply(mydata, c("admit"), summarize, gpa.over3 = length(gpa[gpa>=3]), gpa.over3.5 = length(gpa[gpa>=3.5]), gpa.over3per = length(gpa[gpa>=3])/length(gpa), @@ -277,7 +277,7 @@

Group-wise Operations/plyr/functions

add a column containing the average gre score of students

-
mydata <- ddply(mydata, c("admit"), transform, 
+
mydata <- ddply(mydata, c("admit"), transform,
                 gre.ave=mean(x=gre, na.rm=T),
                 gre.sd = sd(x=gre, na.rm=T))
 head(mydata)