forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
150 lines (121 loc) · 5.03 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Reproducible Research: Peer Assessment 1
### Github source code: https://github.com/iyermobile/RepData_PeerAssessment1
Setup `knitr` options.
```{r setoptions}
opts_chunk$set(message = FALSE, fig.height = 7, fig.width = 7)
```
Load needed libraries.
```{r}
require(plyr)
require(ggplot2)
```
## Loading and preprocessing the data
```{r}
unzip("activity.zip")
activity <- read.csv("activity.csv", header = TRUE)
activity <- transform(activity, date = as.Date(date))
```
## What is mean total number of steps taken per day?
Find the total number of steps taken per day.
```{r}
stepsPerDay <- ddply(activity, ~date, summarise, steps = sum(steps))
```
Plot a histogram of the total number of steps taken per day.
```{r histogram-total-steps}
p <- ggplot(stepsPerDay, aes(steps))
p <- p + geom_histogram(fill = "white", color = "red")
p <- p + ggtitle("Total number of Steps per day")
p + xlab("Steps per day")
```
Compute the `mean` and `median` total number of steps taken per day.
```{r}
meanStepsPerDay <- mean(stepsPerDay$steps, na.rm = TRUE)
medianStepsPerDay <- median(stepsPerDay$steps, na.rm = TRUE)
```
- The mean of total number of steps per day is `r format(meanStepsPerDay)`
- The median of total number of steps per day is `r format(medianStepsPerDay)`
## What is the average daily activity pattern?
Find the average number of steps taken per 5 minute interval.
```{r}
avgStepsPerInterval <- ddply(activity,
~interval,
summarise,
mean = mean(steps, na.rm = T))
```
Make a time series plot of the 5-minute interval and the average number of steps taken and averaged across all days.
```{r plot-average-steps}
p <- ggplot(avgStepsPerInterval, aes(interval, mean)) + geom_line()
p <- p + ggtitle("The average daily activity pattern")
p + xlab("Interval") + ylab("Number of Steps")
```
Find the 5-minute interval that contains the maximum number of steps on average across all the days in the dataset.
```{r}
maxId <- which.max(avgStepsPerInterval$mean)
maxInterval <- avgStepsPerInterval$interval[maxId]
```
- The 5-minute interval, on average across all days, that contains the maximum number of steps is `r maxInterval`
## Imputing missing values
Calculate the total number of missing values in the dataset.
```{r}
numberRowNAs <- sum(apply(is.na(activity), 1, any))
```
- The total number of missing values in the dataset is `r numberRowNAs`
Create a function replacing the NA's step by the mean of 5-minute interval averaged across all days.
```{r}
na.replace <- function(act) {
ddply(act, ~interval, function(dd) {
steps <- dd$steps
dd$steps[is.na(steps)] <- mean(steps, na.rm = TRUE)
return(dd)
})
}
```
Create a new dataset that is equal to the original dataset but with the missing data filled in.
```{r}
imputedActivity <- na.replace(activity)
```
Find the total number of steps taken each day.
```{r}
imputedStepsPerDay <- ddply(imputedActivity, ~date, summarise, steps = sum(steps))
```
Make a histogram of the total number of steps taken each day.
```{r histogram-imputed-total-steps}
p <- ggplot(imputedStepsPerDay, aes(steps))
p <- p + geom_histogram(fill = "white", color = "red")
p <- p + ggtitle("Total number of Steps per day")
p + xlab("Steps per day")
```
Calculate `mean` and `median` total number of steps taken per day.
```{r}
imputedMeanStepsPerDay <- mean(imputedStepsPerDay$steps)
imputedMedianStepsPerDay <- median(imputedStepsPerDay$steps)
```
- The mean of total number steps per day is
`r format(imputedMeanStepsPerDay)`
- The median of total number steps per day is
`r format(imputedMedianStepsPerDay)`
The imputation slightly impacted on the median total number of steps taken per day. It was changed from `r format(medianStepsPerDay)` to `r format(imputedMedianStepsPerDay)`. The mean total number of steps taken per day remained the same. Usually the imputing of missing values can introduce bias in an estimates but in our case impact of it on the estimates of the total daily number of steps is negligible.
## Are there differences in activity patterns between weekdays and weekends?
Create a new factor variable `weekpart` in the dataset with two levels 'weekday' and 'weekend'.
```{r}
weekParts <- c("Weekday", "Weekend")
date2weekpart <- function(date) {
day <- weekdays(date)
part <- factor("Weekday", weekParts)
if (day %in% c("Saturday", "Sunday"))
part <- factor("Weekend", weekParts)
return(part)
}
imputedActivity$weekpart <- sapply(imputedActivity$date, date2weekpart)
```
Make a panel plot containing a time series plot of the 5-minute interval and the average number of steps taken, averaged across all weekday days or weekend days.
```{r plot-weekdays-weekends}
avgSteps <- ddply(imputedActivity,
.(interval, weekpart),
summarise,
mean = mean(steps))
p <- ggplot(avgSteps, aes(x = interval, y = mean))
p <- p + geom_line() + facet_grid(. ~ weekpart, )
p <- p + ggtitle("Activity patterns on Weekdays and Weekends")
p + xlab("Interval") + ylab("Number of Steps")
```