-
Notifications
You must be signed in to change notification settings - Fork 0
/
Forecasting Methods Assignment 3.qmd
309 lines (234 loc) · 10.5 KB
/
Forecasting Methods Assignment 3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
title: "Forecasting Methods Assignment 3"
author: "Kowsalya Muralidharan"
format:
html:
fig-align: center # Center figure alignment
embed-resources: true # Force document to be self-contained (critical!)
code-fold: true # Turn on code fold
toc: true # Create table of contents
self-contained: TRUE
editor: visual
embed-resources: true
execute:
warning: false # Turn off warnings
---
#### SECTION - 1
> | Assess the behavior/data-generating process of your time-series using the techniques discussed in class. From your understanding of the process, would you expect the time-series to be mean stationary? Variance stationary? Why or why not?
> | If the data appear to be variance non-stationary (using a rolling SD or other method), transform your time-series using a natural log or Box-Cox transformation.
> | If seasonality is present in your data, use seasonal differencing to remove the seasonal effect (for example, for monthly data estimate the difference as ).
> | Conduct and interpret a KPSS test for stationarity after performing the above operations (if necessary). Do the results suggest that the process is mean stationary? If the process is mean non-stationary, calculate difference the data until mean stationary (after log/Box-Cox and seasonal differencing) and visualize it.
```{r}
#| warning: false
library(slider)
library(lubridate)
library(tsibble)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(tidyr)
library(patchwork)
library(lubridate)
library(knitr)
library(fpp3)
library(psych)
#install.packages("urca")
library(forecast)
library(urca)
```
```{r}
accidents <- read.csv("C:\\Users\\kowsa\\Downloads\\cpd_crashes.csv")
accidents$date <- ymd(accidents$date)
accidents_tsbl<- accidents %>%
as_tsibble(index=date) %>%
mutate(date = ymd(accidents$date))
accidents_tsbl
start_date <- as.Date("2023-01-01")
end_date <- as.Date("2023-12-31")
# Filter data for the desired time period
accidents_new <- accidents_tsbl %>%
filter(date >= start_date & date <= end_date)
```
```{r}
#Rolling standard deviation
accidents_new %>%
mutate(
sd = slider::slide_dbl(cpd_crashes, sd,.before = 3, .after = 3, .complete = TRUE)
) %>%
ggplot()+
geom_line(aes(date,sd),linewidth=0.5,na.rm=TRUE)+
ggtitle('Rolling Standard Deviation of Cincinnati Crashes',
subtitle = 'Long-term Variance Fairly Constant with some fluctuations')+ theme_bw() +
ylab('Weekly Rolling Standard Deviation')
```
- As there are many peaks and valleys, we can perform transformations to reduce the variance in the data.
- Log transformation has been performed as shown below, and we can see the density curve showing satisfactory results. Hence, we can proceed with the log-transformed value of the number of crashes.
```{r}
#Performing transformations to reduce variance
density = accidents_tsbl %>%
ggplot()+
geom_density(aes(cpd_crashes))+
geom_vline(aes(xintercept=mean(cpd_crashes)),linetype='dashed',color='red')
line = accidents_tsbl %>%
ggplot()+
geom_line(aes(date,cpd_crashes))+
geom_smooth(aes(date,cpd_crashes),method='lm')
log_density = accidents_tsbl %>%
ggplot()+
geom_density(aes(log(cpd_crashes)))+
geom_vline(aes(xintercept=mean(log(cpd_crashes))),linetype='dashed',color='red')
log_line = accidents_tsbl %>%
ggplot()+
geom_line(aes(date,log(cpd_crashes)))+
geom_smooth(aes(date,log(cpd_crashes)),method='lm')
density + line + log_density + log_line+ plot_layout(ncol = 2, byrow = TRUE) + plot_annotation(title = "Comparison of Density and Line Plots before/after Log Transformation")
accidents_tsbl <- accidents_tsbl %>%
mutate(accidents_log = log(cpd_crashes))
accidents_tsbl
```
Now let's check if it is mean stationary.
```{r}
accidents_tsbl_roll <- accidents_tsbl %>%
mutate(
accidents_mean = slide_dbl(
cpd_crashes,
mean,
.before=6,
.after=6,
.complete=T),
accidents_sd = slide_dbl(
cpd_crashes,
sd,
.before=6,
.after=6)
)
accidents_tsbl_roll_mean <- accidents_tsbl_roll %>%
ggplot() +
geom_line(aes(date, cpd_crashes)) +
geom_line(aes(date, accidents_mean),color='blue') +
theme_bw() +
ggtitle("Cincinnati Crashes over Time (12 month rolling window)") +
ylab("Accidents") +
xlab("Date")
accidents_tsbl_roll_mean
```
It seems to be mean stationary. Let's conduct the KPSS Test for stationarity.
```{r}
accidents_diff <- accidents_tsbl %>%
mutate(
accidents_log = log1p(cpd_crashes),
cpd_diff = cpd_crashes - lag(cpd_crashes),
accidents_log_diff = accidents_log - lag(accidents_log)) %>%
drop_na() %>%
as_tsibble(index=date)
# Raw close value - Non-stationary
raw_value_kpss = accidents_diff %>%
features(cpd_crashes, unitroot_kpss)
# First difference - **Non-stationary**
diff_value_kpss = accidents_diff %>%
features(cpd_diff, unitroot_kpss)
# Log close value - Non-stationary
log_value_kpss = accidents_diff %>%
features(accidents_log, unitroot_kpss)
# Differenced log close value - Stationary
log_diff_kpss = accidents_diff %>%
features(accidents_log_diff, unitroot_kpss)
log_value_kpss
```
Since the p-value in the KPSS test is \> 0.05, we can conclude that the time series is mean stationary. This means we do not need to perform any additional steps to induce mean stationarity in the series.
```{r}
accidents_tsbl %>%
gg_tsdisplay(difference(log(cpd_crashes),12),
plot_type='partial', lag=24) +
labs(title="Seasonally differenced (log)", y="")
```
Even though there is a significant spike, we cannot conclude there is significant seasonality in our time-series.
#### SECTION 2
| Produce and interpret ACF/PACF plots of the transformed time series after the above operations. Based on the ACF/PACF alone, does the time-series appear to be an autoregressive process, moving average process, combined, or neither? What might you suspect the order of the time-series to be (e.g. ARIMA(0,1,2)) using the ACF/PACF plots and stationarity tests? If the data contain a seasonal component, describe any seasonal ARIMA effects (PDQ) that may be interpretable from the ACF/PACF plots.
```{r}
accidents_tsbl %>%
gg_tsdisplay(accidents_log, plot_type = 'partial', lag = 24) +
labs(title = "CPD crashes", y = "")
```
- As we can see dampening effect in the ACF plot, we can conclude that the log-transformed time series is an auto-regressive process.
- The PACF plot shows the first lag period captures the most significant portion of the auto-regressive property, so it could be AR(1) process.
- Since there was no differencing applied above, we propose that a model of ARIMA(1,0,0) would be reasonable for this time series.There is no significant seasonality in our time series as concluded earlier.
#### SECTION 3
| Fit several ARIMA models to your time-series variable based on the "best guesses" above. Which model is the "best" according to the AIC or BIC? After comparing several models, implement the automated ARIMA function from fable to find the "best" model. Does the automated ARIMA function select the same model as you did? If not, why do you think this is the case?
| After selecting the best model, derive the fitted values for the time series and plot them against the observed values of the series. Do the in-sample predicted values tend to follow the trends in the data?
```{r}
# Automatic Differencing using fable
accidents_tsbl %>%
model(
ARIMA(log(cpd_crashes)~pdq(1,0,0))
) %>%
report()
```
Now let's build and compare few models.
```{r}
models_bic = accidents_tsbl %>%
model(
mod1 = ARIMA(log(cpd_crashes)~pdq(0,1,0)+ PDQ(0,0,0)),
mod2 = ARIMA(log(cpd_crashes)~pdq(0,1,1)+ PDQ(0,0,0)),
mod3 = ARIMA(log(cpd_crashes)~pdq(1,1,0)+ PDQ(0,0,0)),
mod4 = ARIMA(log(cpd_crashes)~pdq(1,0,0)+ PDQ(0,0,0))
)
models_bic %>%
glance() %>%
arrange(BIC)
```
Based on the BIC, Model 4 (1,0,0)(0,0,0) is the best. Now let's run ARIMA function to see if it matches our expectation.
```{r}
best_mod<- accidents_tsbl %>%
model(
ARIMA(log(cpd_crashes) ~ PDQ(0,0,0),approximation=F)
) %>%
report()
```
ARIMA has selected (1,0,0) as the best model, which is in sync with Model 4 which was selected on the basis of least BIC.
```{r}
#best_model<- ARIMA(log(cpd_crashes)~pdq(1,0,0)+ PDQ(1,0,0))
# Assuming 'cpd_crashes' is your time series data
# Fit ARIMA model
best_model <- auto.arima(log(accidents_tsbl$cpd_crashes))
# Obtain fitted values
fitted <- fitted(best_model)
ggplot() +
geom_line(aes(accidents_tsbl$date, accidents_tsbl$accidents_log)) +
geom_line(aes(accidents_tsbl$date, fitted),color="blue") +
theme_bw() +
xlab("Date") +
ylab("Number of Crashes")+
ggtitle("Fitted values vs. Observed values")
```
The in-sample predicted values tend to follow the trends in the data.
#### SECTION 4
| Compute and analyze the residuals from the selected model. Conduct a Box-Ljung test for residual autocorrelation and examine the ACF/PACF plots in the residuals. Do the residuals appear to be white noise? If the residuals do not appear to be white noise, either interpret your results (or lack thereof) or suggest a possible solution to the problem.
| Finally, create a 6 time-period forecast for your model from the end of the data. Do the forecasted values seem reasonable? Why or why not?
```{r}
best_mod %>%
gg_tsresiduals()
```
- From the above plot, we can see that the residuals are normally distributed.
- The model is well-fitted as we do not see any significant autocorrelation at lag 1 in the ACF plot.
- There does not appear to be any seasonality in the ACF residual plot.
Now, let's perform the Ljung-Box Test to check for any leftover autocorrelation in the residuals.
```{r}
best_mod %>%
augment() %>%
features(.innov, ljung_box, lag = 10, dof = 1)
```
The p-value is \> 0.05, indicating that there is not residual auto-correlation and that the residuals are white noise.
```{r}
accidents_tsbl %>%
model(
ARIMA(log(cpd_crashes),approximation=F) # Didn't set stepwise here because of size of the data
) %>%
forecast(h=6) %>%
autoplot(
accidents_tsbl %>%
filter(date>=ymd('2023-12-01'))
)+
theme_bw() + ggtitle("6 day forecast of CPD crashes")
```
The forecast values does seem reasonable, given the beginning of new year 2024, there may be more traffic and visitors in Cincinnati, leading to more number of potential crashes.