diff --git a/PUBLICITY.md b/PUBLICITY.md new file mode 100644 index 0000000..8a15c07 --- /dev/null +++ b/PUBLICITY.md @@ -0,0 +1,55 @@ +# Facetweet announcement + +Learn how to analyze your datasets in R! [insert link here](https://youtu.be/dQw4w9WgXcQ) + +# Information for calendar + +The workshop duration is 3hrs per class. + +# Descriptions for website + +## Header + +**title** : R for Data Science + +**description** : The R for Data Science workshop series is a four part course, designed to take novices in the R language for statistical computing and produce programmers who are competent in finding, displaying, analyzing, and publishing data in R. + +## Part 1 + +**subtitle** : Basics of R + +**description** : Students will understand the motivation behind object orientation, and how that relates to computation. Students will be able to perform basic functions in R necessary to use the software on their computers and conduct basic arithmetic. Students will understand data types and data structures, and why and how they are different from each other. + +**knowledge requirements** : [Programming Fun!damentals](https://github.com/dlab-berkeley/programming-fundamentals), or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required + +## Part 2 + +**subtitle** : Clean and tidy data + +**description** : Students will be introduced to DRY principles and best practices for sanitizing and tidying data. Students will learn what missingness is, and how best to accommodate missing data in their research designs. Students will be able to read in files from disk or a database, clean the data found within them, select specific data from them, and merge them with other datasets. + +**knowledge requirements** : R-for-Data-Science Part 1 or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required + +## Part 3 + +**subtitle** : Analyzing data + +**description** : Students will be introduced to the principles behind the grammar of graphics and the general linear model. Students will understand the implementation of plotting in R. Students will be able to explore, summarize, and analyze data using R's implementation of exploratory and inferential data analysis. + +**knowledge requirements** : R-for-Data-Science Part 2 or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required + +## Part 4 + +**subtitle** : Functions and packages + +**description** : Students will be introduced to the principles behind functional programming. Students will learn how to write and import functions, add looped and vectorized computation to their functions, and control the flow of data through a function. Students will understand the basics of name spaces, and how that relates to assigning values within functions. Students will see how to successfully package a function for CRAN. + +**knowledge requirements** : R-for-Data-Science Part 2 or equivalent prior knowledge + +**tech requirements** : Laptop required; please install R version 3.2 or greater in advance (University laptops will need to have R installed by an administrator); the RStudio IDE is recommended but not required diff --git a/README.md b/README.md index 3c4651e..b46ecc0 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ The instructor of this workshop series will lead you through the activities for ## If you are a D-Lab instructor -You'll see accumulated teaching notes and examples for each day's topics in the instructor folder. For your convenience, these are available as .Rmd, commented .R files, PDF documents, and HTML slides. +You'll see accumulated teaching notes and examples for each day's topics in the instructor folder. For your convenience, these are available as .Rmd, commented .R files, PDF documents, and HTML slides. The meta-document for this workshop series, which explains the logic behind the structure and topics, can be viewed [at the D-Lab guides repository](https://github.com/dlab-berkeley/guides/blob/master/r.pdf) For information on contributing to this repository, see `CONTRIBUTING.md` @@ -61,17 +61,17 @@ This workshop series covers: This workshop uses the following packages: -1. Amelia -2. devtools -3. dplyr -4. foreign -5. ggplot2 -6. parallelMap -7. RCurl -8. reshape2 -9. roxygen2 -10. stringr -11. XML +* Amelia +* devtools +* dplyr +* foreign +* ggplot2 +* parallelMap +* RCurl +* roxygen2 +* stringr +* tidyr +* XML --- _D-Lab == Data Intensive Social Science, For All!_ diff --git a/data/dirty.csv b/data/dirty.csv index 92aa05e..b7d2c22 100644 --- a/data/dirty.csv +++ b/data/dirty.csv @@ -1,6 +1,6 @@ Timestamp,How tall are you?,What department are you in?,Are you currently enrolled?,What is your birth order? 7/25/2015 10:08:41,very,Geology ,Yes,1 7/25/2015 10:10:56,70,999,Yes,1 -7/25/2015 10:11:20,5’9, geology,999,2 +7/25/2015 10:11:20,5'9, geology,999,2 7/25/2015 10:11:25,2.1,goelogy,No,"9,000" -7/25/2015 10:11:29,156,anthro,999,2 \ No newline at end of file +7/25/2015 10:11:29,156,anthro,999,2 diff --git a/instructor/day_four.html b/instructor/day_four.html index 4576a9c..d3c37e2 100644 --- a/instructor/day_four.html +++ b/instructor/day_four.html @@ -55,7 +55,7 @@
02 May, 2016
+24 May, 2016
You can also loop over a non-numeric vector
@@ -93,11 +93,11 @@Next we move on to control structures, such as if statements. ``If" statements are very useful when you want to assign different tasks to different subsets of data using a single for-loop. The basic syntax looks like the following: if(condition){statement} else{other statement}
@@ -123,7 +123,7 @@Looping
diff --git a/instructor/day_one.html b/instructor/day_one.html index f2fd818..c80ac3d 100644 --- a/instructor/day_one.html +++ b/instructor/day_one.html @@ -54,7 +54,7 @@x <- 7 if(x > 10){ print(x) - + }else{ # "else" should not start its own line. # Always let it be preceded by a closing brace on the same line. print("NOT BIG ENOUGH!!") @@ -137,15 +137,15 @@
Looping
## [8] "male" "male" "male" "female" "female" "male" "female" ## [15] "male" "female" "female" "male" "male" "female" "female" ## [22] "male" "male" "male" "male" "female" "male" "female" -## [29] "female" "male" "male" "female" "female" "female" "male" -## [36] "female" "female" "female" "female" "female" "male" "male" -## [43] "male" "female" "male" "female" "female" "male" "male" +## [29] "female" "male" "male" "female" "female" "female" "male" +## [36] "female" "female" "female" "female" "female" "male" "male" +## [43] "male" "female" "male" "female" "female" "male" "male" ## [50] "male" "male" "male" "male" "female" "female" "female" -## [57] "male" "female" "male" "female" "male" "female" "male" -## [64] "female" "female" "female" "female" "male" "male" "male" +## [57] "male" "female" "male" "female" "male" "female" "male" +## [64] "female" "female" "female" "female" "male" "male" "male" ## [71] "female" "male" "female" "male" "female" "female" "female" -## [78] "female" "male" "male" "male" "male" "female" "male" -## [85] "female" "male" "female" "male" "female" "female" "male" +## [78] "female" "male" "male" "male" "male" "female" "male" +## [85] "female" "male" "female" "male" "female" "female" "male" ## [92] "male" "male" "male" "female" "female" "female" "female" ## [99] "female" "male"gender <- ifelse(gender=="male", 1, 0) @@ -175,7 +175,7 @@
every function has three parts
body(f)
## x + 1
-environment(f)
+## <environment: 0x7f81d9374c60>
## <environment: 0x7fe163bba308>
environments are where the function was defined
see how our function has
R_GlobalEnv
as it’s environment? that’s because we defined it in the global environmentthis means that if you tell a function to look for an
@@ -260,13 +260,13 @@object
, it will look in the global namespacethe right way to be functional
lapply(heights, in_to_m)
## [[1]] ## [1] 1.7526 -## +## ## [[2]] ## [1] 1.3716 -## +## ## [[3]] ## [1] 1.8542 -## +## ## [[4]] ## [1] 2.0828
it’s not always smart to name functions
@@ -274,13 +274,13 @@it’s not always smart to name
lapply(heights, FUN = function(x) x %/% 12)
## [[1]] ## [1] 5 -## +## ## [[2]] ## [1] 4 -## +## ## [[3]] ## [1] 6 -## +## ## [[4]] ## [1] 6
lapply has limits
@@ -293,10 +293,10 @@lapply has limits
lapply(dat, mean)
## $a ## [1] NA -## +## ## $b ## [1] NA -## +## ## $c ## [1] NA
we know there are numbers there - why are the means all missing?
@@ -315,16 +315,16 @@this can be parallelized
side note - previous versions of these materials imported the
parallel
library, which is no longer supported as of R versions >= 3.2-install.packages('parallelMap')
## +
+ /var/folders/rj/8gpcssqd52z9yrqw7f8xxfym0000gn/T//RtmpmP1txl/downloaded_packages## ## The downloaded binary packages are in -## /var/folders/rj/8gpcssqd52z9yrqw7f8xxfym0000gn/T//Rtmp2xjYZ7/downloaded_packages
-library(parallelMap) system.time(Map(median, dat, na.rm=TRUE))
## user system elapsed +
## user system elapsed ## 0.000 0.000 0.001
-system.time(parallelMap(median, dat, na.rm=TRUE))
+## user system elapsed -## 0.001 0.000 0.001
## user system elapsed +## 0 0 0
parallel processing incurs time costs from memory management and message passing that can make small jobs take longer in parallel than in serial
How should we count the number of the words appear in each act? Create a wrapper function that counts the number of the words and returns the number.
countR <- function(z){ - return(c(length(grep("Romeo", z, perl=T)), length(grep("Juliet", z, perl=T)))) + return(c(length(grep("Romeo", z, perl=T)), length(grep("Juliet", z, perl=T)))) } lapply(x, countR)
## [[1]] ## [1] 8 4 -## +## ## [[2]] ## [1] 30 3 -## +## ## [[3]] ## [1] 54 13 -## +## ## [[4]] ## [1] 9 8 -## +## ## [[5]] ## [1] 20 19
Now count the lines in each scene
# now count the lines in each scene countL <- function(z){ - return(length(grep("</A><br>$", z, perl=T))) + return(length(grep("</A><br>$", z, perl=T))) } lapply(x, countL)
## [[1]] ## [1] 739 -## +## ## [[2]] ## [1] 685 -## +## ## [[3]] ## [1] 821 -## +## ## [[4]] ## [1] 407 -## +## ## [[5]] ## [1] 441
Day One: R Basics
-02 May, 2016
+24 May, 2016
diff --git a/instructor/day_two.html b/instructor/day_two.html index adca61f..5a8d1bb 100644 --- a/instructor/day_two.html +++ b/instructor/day_two.html @@ -55,7 +55,7 @@Day Two: Data Cleaning
-02 May, 2016
+24 May, 2016
@@ -200,16 +200,16 @@## 'data.frame': 5 obs. of 5 variables: ## $ Timestamp : Factor w/ 5 levels "7/25/2015 10:08:41",..: 1 2 3 4 5 -## $ How.tall.are.you. : Factor w/ 5 levels "156","2.1","5’9",..: 5 4 3 2 1 +## $ How.tall.are.you. : Factor w/ 5 levels "156","2.1","5'9",..: 5 4 3 2 1 ## $ What.department.are.you.in.: Factor w/ 5 levels " geology","999",..: 4 2 1 5 3 ## $ Are.you.currently.enrolled.: Factor w/ 3 levels "999","No","Yes": 3 3 1 2 1 ## $ What.is.your.birth.order. : Factor w/ 3 levels "1","2","9,000": 1 1 2 3 2
it’s usua str(dirty)
## 'data.frame': 5 obs. of 5 variables: ## $ Timestamp : chr "7/25/2015 10:08:41" "7/25/2015 10:10:56" "7/25/2015 10:11:20" "7/25/2015 10:11:25" ... -## $ How.tall.are.you. : chr "very" "70" "5’9" "2.1" ... +## $ How.tall.are.you. : chr "very" "70" "5'9" "2.1" ... ## $ What.department.are.you.in.: chr "Geology " "999" " geology" "goelogy" ... ## $ Are.you.currently.enrolled.: chr "Yes" "Yes" "999" "No" ... ## $ What.is.your.birth.order. : chr "1" "1" "2" "9,000" ...
let’s start by removing the empty rows and columns
tail(dirty)
## Timestamp How.tall.are.you. What.department.are.you.in. -## 1 7/25/2015 10:08:41 very Geology +## 1 7/25/2015 10:08:41 very Geology ## 2 7/25/2015 10:10:56 70 999 -## 3 7/25/2015 10:11:20 5’9 geology +## 3 7/25/2015 10:11:20 5'9 geology ## 4 7/25/2015 10:11:25 2.1 goelogy ## 5 7/25/2015 10:11:29 156 anthro ## Are.you.currently.enrolled. What.is.your.birth.order. @@ -224,7 +224,7 @@
let’s start by remo
you can replace variable names
and you should, if they are uninformative or long
-names(dirty)
## [1] "Timestamp" "How.tall.are.you." +
## [1] "Timestamp" "How.tall.are.you." ## [3] "What.department.are.you.in." "Are.you.currently.enrolled." ## [5] "What.is.your.birth.order."
@@ -234,13 +234,13 @@names(dirty) <- c("time", "height", "dept", "enroll", "birth.order")
you should replace all of these values in your dataframe with R’s missingness signifier,
NA
-table(dirty$enroll)
## -## 999 No Yes +
## +## 999 No Yes ## 2 1 2
-dirty$enroll[dirty$enroll=="999"] <- NA table(dirty$enroll, useNA = "ifany")
## -## No Yes <NA> +
## +## No Yes <NA> ## 1 2 2
side note - read.table() has an option to specify field values as
@@ -310,27 +310,27 @@NA
as soon as you import the data, but this is a BAAAAD idea because R automatically encodes blank fields as missing too, and thus you lose the ability to distinguish between user-missing and experimenter-missingremember how we tal
let’s use this large dataset as an example
-large <- read.csv('data/large.csv') summary(large)
## a b c -## Min. :-33.98426 Min. :-13.4 Min. :-249998.64 -## 1st Qu.: -6.71903 1st Qu.:128.6 1st Qu.:-141005.65 -## Median : 0.41681 Median :256.9 Median : -63498.56 -## Mean : 0.00176 Mean :252.2 Mean : -83954.09 -## 3rd Qu.: 7.00630 3rd Qu.:377.5 3rd Qu.: -15748.98 -## Max. : 35.33306 Max. :513.3 Max. : 11.77 +
## a b c +## Min. :-33.98426 Min. :-13.4 Min. :-249998.64 +## 1st Qu.: -6.71903 1st Qu.:128.6 1st Qu.:-141005.65 +## Median : 0.41681 Median :256.9 Median : -63498.56 +## Mean : 0.00176 Mean :252.2 Mean : -83954.09 +## 3rd Qu.: 7.00630 3rd Qu.:377.5 3rd Qu.: -15748.98 +## Max. : 35.33306 Max. :513.3 Max. : 11.77 ## NA's :45 NA's :45 NA's :45
nrow(na.omit(large))
## [1] 871
for it to work you need low missingness and large N
a <- amelia(large,m = 1)
## -- Imputation 1 -- -## +## ## 1 2 3
-print(a)
## +
@@ -338,12 +338,12 @@## ## Amelia output with 1 imputed datasets. -## Return code: 1 -## Message: Normal EM convergence. -## +## Return code: 1 +## Message: Normal EM convergence. +## ## Chain Lengths: ## -------------- ## Imputation 1: 3
-large.imputed <- a[[1]][[1]] summary(large.imputed)
## a b c -## Min. :-33.98426 Min. :-13.4 Min. :-249999 -## 1st Qu.: -6.73649 1st Qu.:126.5 1st Qu.:-140641 -## Median : 0.30970 Median :252.0 Median : -63513 -## Mean : -0.01213 Mean :250.0 Mean : -83156 -## 3rd Qu.: 6.99412 3rd Qu.:373.9 3rd Qu.: -15561 +
## a b c +## Min. :-33.98426 Min. :-13.4 Min. :-249999 +## 1st Qu.: -6.73649 1st Qu.:126.5 1st Qu.:-140641 +## Median : 0.30970 Median :252.0 Median : -63513 +## Mean : -0.01213 Mean :250.0 Mean : -83156 +## 3rd Qu.: 6.99412 3rd Qu.:373.9 3rd Qu.: -15561 ## Max. : 35.33306 Max. :518.7 Max. : 69498
if you give it a tiny dataset, it will fuss at you
@@ -352,14 +352,14 @@a <- amelia(large[990:1000,],m = 1)
if you give it a tiny ## variables in the imputation model. Consider removing some variables, or ## reducing the order of time polynomials to reduce the number of parameters.
## -- Imputation 1 -- -## +## ## 1 2
-print(a)
## +
@@ -404,10 +404,10 @@## ## Amelia output with 1 imputed datasets. -## Return code: 1 -## Message: Normal EM convergence. -## +## Return code: 1 +## Message: Normal EM convergence. +## ## Chain Lengths: ## -------------- ## Imputation 1: 2
subsetting data frames
my.data$numeric == 2
## logical(0)
-my.data[my.data$numeric == 2,]
## [1] n -## [2] c -## [3] b -## [4] d +
## [1] n +## [2] c +## [3] b +## [4] d ## [5] really.long.and.complicated.variable.name ## <0 rows> (or 0-length row.names)
boolean variables can act as filters right out of the box
@@ -423,19 +423,19 @@you can also select columns
you can also match elements from a vector
-good.things <- c("three", "four", "five") my.data[my.data$character %in% good.things, ]
## [1] n -## [2] c -## [3] b -## [4] d +
## [1] n +## [2] c +## [3] b +## [4] d ## [5] really.long.and.complicated.variable.name ## <0 rows> (or 0-length row.names)
most subsetting operations on dataframes also return a dataframe
str(my.data[!(my.data$character %in% good.things), ])
## 'data.frame': 0 obs. of 5 variables: -## $ n : num -## $ c : Factor w/ 3 levels "one","three",..: -## $ b : logi -## $ d :Class 'Date' num(0) +## $ n : num +## $ c : Factor w/ 3 levels "one","three",..: +## $ b : logi +## $ d :Class 'Date' num(0) ## $ really.long.and.complicated.variable.name: num
subsets that are a single column return a vector
@@ -484,16 +484,16 @@str(my.data$numeric)
reshaping
side note - don’t worry about how this works yet - we’ll talk about it tomorrow
-t.test(score ~ time, data=normal)
## +
## ## Welch Two Sample t-test -## +## ## data: score by time ## t = 0.58132, df = 2.0278, p-value = 0.6191 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -73.56101 96.89434 ## sample estimates: -## mean in group 1 mean in group 2 +## mean in group 1 mean in group 2 ## 110.00000 98.33333
it’s easy to combine tidy tables to compare different levels of information simultaneously
@@ -608,7 +608,7 @@dplyr allows you to apply
group_by(normal, time)
## Source: local data frame [6 x 4] ## Groups: time [2] -## +## ## name time score id ## (fctr) (dbl) (dbl) (int) ## 1 Alice 1 90 1 @@ -619,7 +619,7 @@
dplyr allows you to apply ## 6 Eve 2 100 6
summarize(group_by(normal, time), mean(score))
## Source: local data frame [2 x 2] -## +## ## time mean(score) ## (dbl) (dbl) ## 1 1 110.00000 @@ -627,7 +627,7 @@
dplyr allows you to apply
mutate(group_by(normal, time), diff=score-mean(score))
## Source: local data frame [6 x 5] ## Groups: time [2] -## +## ## name time score id diff ## (fctr) (dbl) (dbl) (int) (dbl) ## 1 Alice 1 90 1 -20.000000 @@ -638,7 +638,7 @@
dplyr allows you to apply ## 6 Eve 2 100 6 1.666667
ungroup(mutate(group_by(normal, time), diff=score-mean(score)))
## Source: local data frame [6 x 5] -## +## ## name time score id diff ## (fctr) (dbl) (dbl) (int) (dbl) ## 1 Alice 1 90 1 -20.000000 diff --git a/instructor/overflow.html b/instructor/overflow.html index 7941bf6..b7c300b 100644 --- a/instructor/overflow.html +++ b/instructor/overflow.html @@ -54,7 +54,7 @@
Additional Course Materials
-02 May, 2016
+24 May, 2016
you can use this to access remote data
you may just want to read text lines from a webpage
-RJ <- readLines("http://shakespeare.mit.edu/romeo_juliet/full.html") +
-RJ <- readLines("http://shakespeare.mit.edu/romeo_juliet/full.html") RJ[1:25]
## [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\"" -## [2] " \"http://www.w3.org/TR/REC-html40/loose.dtd\">" -## [3] " <html>" -## [4] " <head>" -## [5] " <title>Romeo and Juliet: Entire Play" -## [6] " </title>" +
## [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\"" +## [2] " \"http://www.w3.org/TR/REC-html40/loose.dtd\">" +## [3] " <html>" +## [4] " <head>" +## [5] " <title>Romeo and Juliet: Entire Play" +## [6] " </title>" ## [7] " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">" -## [8] " <LINK rel=\"stylesheet\" type=\"text/css\" media=\"screen\"" -## [9] " href=\"/shake.css\">" -## [10] " </HEAD>" -## [11] " <body bgcolor=\"#ffffff\" text=\"#000000\">" -## [12] "" -## [13] "<table width=\"100%\" bgcolor=\"#CCF6F6\">" -## [14] "<tr><td class=\"play\" align=\"center\">Romeo and Juliet" -## [15] "<tr><td class=\"nav\" align=\"center\">" -## [16] " <a href=\"/Shakespeare\">Shakespeare homepage</A> " -## [17] " | <A href=\"/romeo_juliet/\">Romeo and Juliet</A> " -## [18] " | Entire play" -## [19] "</table>" -## [20] "" -## [21] "<H3>ACT I</h3>" -## [22] "<h3>PROLOGUE</h3>" -## [23] "<blockquote>" -## [24] "<A NAME=1.0.1>Two households, both alike in dignity,</A><br>" +## [8] " <LINK rel=\"stylesheet\" type=\"text/css\" media=\"screen\"" +## [9] " href=\"/shake.css\">" +## [10] " </HEAD>" +## [11] " <body bgcolor=\"#ffffff\" text=\"#000000\">" +## [12] "" +## [13] "<table width=\"100%\" bgcolor=\"#CCF6F6\">" +## [14] "<tr><td class=\"play\" align=\"center\">Romeo and Juliet" +## [15] "<tr><td class=\"nav\" align=\"center\">" +## [16] " <a href=\"/Shakespeare\">Shakespeare homepage</A> " +## [17] " | <A href=\"/romeo_juliet/\">Romeo and Juliet</A> " +## [18] " | Entire play" +## [19] "</table>" +## [20] "" +## [21] "<H3>ACT I</h3>" +## [22] "<h3>PROLOGUE</h3>" +## [23] "<blockquote>" +## [24] "<A NAME=1.0.1>Two households, both alike in dignity,</A><br>" ## [25] "<A NAME=1.0.2>In fair Verona, where we lay our scene,</A><br>"
and use the kinds of string manipulation we learned yesterday to retrieve the first lines of an act or a scene
-RJ[grep("<h3>", RJ, perl=T)]
## [1] "<h3>PROLOGUE</h3>" -## [2] "<h3>SCENE I. Verona. A public place.</h3>" -## [3] "<h3>SCENE II. A street.</h3>" -## [4] "<h3>SCENE III. A room in Capulet's house.</h3>" -## [5] "<h3>SCENE IV. A street.</h3>" -## [6] "<h3>SCENE V. A hall in Capulet's house.</h3>" -## [7] "<h3>PROLOGUE</h3>" -## [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>" -## [9] "<h3>SCENE II. Capulet's orchard.</h3>" -## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>" -## [11] "<h3>SCENE IV. A street.</h3>" -## [12] "<h3>SCENE V. Capulet's orchard.</h3>" -## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>" -## [14] "<h3>SCENE I. A public place.</h3>" -## [15] "<h3>SCENE II. Capulet's orchard.</h3>" -## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>" -## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>" -## [18] "<h3>SCENE V. Capulet's orchard.</h3>" -## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>" -## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>" -## [21] "<h3>SCENE III. Juliet's chamber.</h3>" -## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>" -## [23] "<h3>SCENE V. Juliet's chamber.</h3>" -## [24] "<h3>SCENE I. Mantua. A street.</h3>" -## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>" +
## [1] "<h3>PROLOGUE</h3>" +## [2] "<h3>SCENE I. Verona. A public place.</h3>" +## [3] "<h3>SCENE II. A street.</h3>" +## [4] "<h3>SCENE III. A room in Capulet's house.</h3>" +## [5] "<h3>SCENE IV. A street.</h3>" +## [6] "<h3>SCENE V. A hall in Capulet's house.</h3>" +## [7] "<h3>PROLOGUE</h3>" +## [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>" +## [9] "<h3>SCENE II. Capulet's orchard.</h3>" +## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>" +## [11] "<h3>SCENE IV. A street.</h3>" +## [12] "<h3>SCENE V. Capulet's orchard.</h3>" +## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>" +## [14] "<h3>SCENE I. A public place.</h3>" +## [15] "<h3>SCENE II. Capulet's orchard.</h3>" +## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>" +## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>" +## [18] "<h3>SCENE V. Capulet's orchard.</h3>" +## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>" +## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>" +## [21] "<h3>SCENE III. Juliet's chamber.</h3>" +## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>" +## [23] "<h3>SCENE V. Juliet's chamber.</h3>" +## [24] "<h3>SCENE I. Mantua. A street.</h3>" +## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>" ## [26] "<h3>SCENE III. A churchyard; in it a tomb belonging to the Capulets.</h3>"
-RJ[grep("<h3>", RJ, perl=TRUE)]
## [1] "<h3>PROLOGUE</h3>" -## [2] "<h3>SCENE I. Verona. A public place.</h3>" -## [3] "<h3>SCENE II. A street.</h3>" -## [4] "<h3>SCENE III. A room in Capulet's house.</h3>" -## [5] "<h3>SCENE IV. A street.</h3>" -## [6] "<h3>SCENE V. A hall in Capulet's house.</h3>" -## [7] "<h3>PROLOGUE</h3>" -## [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>" -## [9] "<h3>SCENE II. Capulet's orchard.</h3>" -## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>" -## [11] "<h3>SCENE IV. A street.</h3>" -## [12] "<h3>SCENE V. Capulet's orchard.</h3>" -## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>" -## [14] "<h3>SCENE I. A public place.</h3>" -## [15] "<h3>SCENE II. Capulet's orchard.</h3>" -## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>" -## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>" -## [18] "<h3>SCENE V. Capulet's orchard.</h3>" -## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>" -## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>" -## [21] "<h3>SCENE III. Juliet's chamber.</h3>" -## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>" -## [23] "<h3>SCENE V. Juliet's chamber.</h3>" -## [24] "<h3>SCENE I. Mantua. A street.</h3>" -## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>" +
## [1] "<h3>PROLOGUE</h3>" +## [2] "<h3>SCENE I. Verona. A public place.</h3>" +## [3] "<h3>SCENE II. A street.</h3>" +## [4] "<h3>SCENE III. A room in Capulet's house.</h3>" +## [5] "<h3>SCENE IV. A street.</h3>" +## [6] "<h3>SCENE V. A hall in Capulet's house.</h3>" +## [7] "<h3>PROLOGUE</h3>" +## [8] "<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>" +## [9] "<h3>SCENE II. Capulet's orchard.</h3>" +## [10] "<h3>SCENE III. Friar Laurence's cell.</h3>" +## [11] "<h3>SCENE IV. A street.</h3>" +## [12] "<h3>SCENE V. Capulet's orchard.</h3>" +## [13] "<h3>SCENE VI. Friar Laurence's cell.</h3>" +## [14] "<h3>SCENE I. A public place.</h3>" +## [15] "<h3>SCENE II. Capulet's orchard.</h3>" +## [16] "<h3>SCENE III. Friar Laurence's cell.</h3>" +## [17] "<h3>SCENE IV. A room in Capulet's house.</h3>" +## [18] "<h3>SCENE V. Capulet's orchard.</h3>" +## [19] "<h3>SCENE I. Friar Laurence's cell.</h3>" +## [20] "<h3>SCENE II. Hall in Capulet's house.</h3>" +## [21] "<h3>SCENE III. Juliet's chamber.</h3>" +## [22] "<h3>SCENE IV. Hall in Capulet's house.</h3>" +## [23] "<h3>SCENE V. Juliet's chamber.</h3>" +## [24] "<h3>SCENE I. Mantua. A street.</h3>" +## [25] "<h3>SCENE II. Friar Laurence's cell.</h3>" ## [26] "<h3>SCENE III. A churchyard; in it a tomb belonging to the Capulets.</h3>"
or maybe pull information out of an RSS feed
link <- "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml" @@ -210,20 +210,20 @@
Connecting to a database
install.packages("RPostgreSQL") library(RPostgreSQL) con <- dbConnect(dbDriver("PostgreSQL"), - dbname="", + dbname="", host="localhost", - port=1234, - user="", + port=1234, + user="", password="") data <- dbReadTable(con, c("column1","column2")) dbDisconnect(con)
a popular non-relational database is MongoDB
install.packages("rmongodb") library(rmongodb) -con <- mongo.create(host = localhost, - name = "", - username = "", - password = "", +con <- mongo.create(host = localhost, + name = "", + username = "", + password = "", db = "admin") if(mongo.is.connected(con) == TRUE) { data <- mongo.find.all(con, "collection", list("city" = list( "$exists" = "true"))) @@ -260,7 +260,7 @@
group-wise operations/plyr mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") # Consider the case where we want to calculate descriptive statistics across admits and not-admits # from the dataset and return them as a data.frame -ddata <- ddply(mydata, c("admit"), summarize, +ddata <- ddply(mydata, c("admit"), summarize, gpa.over3 = length(gpa[gpa>=3]), gpa.over3.5 = length(gpa[gpa>=3.5]), gpa.over3per = length(gpa[gpa>=3])/length(gpa), @@ -277,7 +277,7 @@
Group-wise Operations/plyr/functions