-
Notifications
You must be signed in to change notification settings - Fork 0
/
Forecasting Methods Assignment 4.qmd
277 lines (229 loc) · 8.16 KB
/
Forecasting Methods Assignment 4.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
title: "Forecasting Methods Assignment 4"
author: "Kowsalya Muralidharan"
format:
html:
fig-align: center # Center figure alignment
embed-resources: true # Force document to be self-contained (critical!)
code-fold: true # Turn on code fold
self-contained: TRUE
editor: visual
embed-resources: true
execute:
warning: false # Turn off warnings
theme: "pulse"
---
## **Section 1 - Train and Test Splits**
Approximately 80% of the dataset has been used for training and the remaining is used for testing the model. From the below plot, we can see that the test set is a good representation of our training set.
```{r}
#| warning: false
library(slider)
library(lubridate)
library(tsibble)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(tidyr)
library(patchwork)
library(lubridate)
library(knitr)
library(fpp3)
library(psych)
library(forecast)
library(urca)
library(data.table)
```
```{r}
accidents <- read.csv("C:\\Users\\kowsa\\Downloads\\cpd_crashes.csv")
accidents$date <- ymd(accidents$date)
accidents_tsbl<- accidents %>%
as_tsibble(index=date) %>%
mutate(date = ymd(accidents$date))
accidents_tsbl
```
```{r}
split_date <- as.Date("2023-01-01") + floor(0.8 * nrow(accidents_tsbl))
train_data <- accidents_tsbl %>% filter(date < split_date)
test_data <- accidents_tsbl %>% filter(date >= split_date)
ggplot(train_data, aes(x = date, y = cpd_crashes)) +
geom_line(data = train_data, aes(x = date, y = cpd_crashes), color = "blue", linetype = "solid",size=0.5) +
geom_line(data = test_data, color = 'red') +
geom_vline(xintercept = split_date, color = 'darkgreen', linetype = "dashed") +
theme_bw() +
ylab('Number of Crashes') +
ggtitle('Training and Test Data')
```
## **Section 2 - Cross-Validation Scheme**
Next, we set up a rolling window cross-validation scheme using `stretch_tsibble.` The initial training period and step have been set at 30 to group the samples on a monthly basis.
```{r}
accidents_cv = accidents_tsbl %>%
stretch_tsibble(.init = 30, .step = 30)
accidents_cv
```
```{r}
accidents_cv %>%
ggplot()+
geom_point(aes(date,factor(.id),color=factor(.id)))+
ylab('Iteration')+
ggtitle('Samples included in each CV Iteration')
```
Based on our initial and step value, we have 12 iterations in our cross-validation scheme.
## **Section 3 - Model Selection and Comparison**
To compare models, let's fit ARIMA and naive model to each fold in the cross-validation scheme. For further analysis, we shall visualize and validate which model performs better.
```{r}
accidents_cv %>%
model(
arima = ARIMA(cpd_crashes~pdq(1,0,0)+PDQ(0,0,0)),
naive = NAIVE(cpd_crashes)
)
```
```{r}
accidents_cv_forecast <- accidents_cv %>%
model(
naive = NAIVE(cpd_crashes),
arima = ARIMA(cpd_crashes ~ pdq(1,0,0)),
naive_w_drift = NAIVE(cpd_crashes~drift())
) %>%
forecast(h = 14)
accidents_cv_forecast %>%
autoplot(accidents_cv) +
facet_wrap(~.id,nrow=4, scales = 'free_y') +
theme_bw() +
ylab('Number of Crashes') + ggtitle("Cross validation forecast comparison for each iteration") + # Using month abbreviations as labels
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
```{r}
accidents_cv_forecast %>%
as_tibble() %>%
select(-cpd_crashes) %>%
left_join(
accidents_tsbl
) %>%
ggplot()+
geom_line(aes(date,cpd_crashes))+
geom_line(aes(date,.mean,color=factor(.id),linetype=.model))+
scale_color_discrete(name='Iteration')+
ylab('CPD Crashes')+ ggtitle("Cross validation forecast comparison for each model")
```
When we combine the 12 iterations of each model, we can see that NAIVE performs poorly and ARIMA performs the best among the three models. It follows the underlying trend of our training data more closely.
To further compare the performance of the two models, we will use the RMSE, MAE, MAPE and MASE.
```{r}
accidents_cv_forecast %>%
as_tibble() %>%
select(-cpd_crashes) %>%
left_join(accidents_tsbl) %>%
group_by(.id,.model) %>%
mutate(
weeks_ahead = seq(1:14)
) %>%
ungroup() %>%
mutate(error = abs(cpd_crashes - .mean)) %>%
ggplot()+
geom_boxplot(aes(factor(weeks_ahead),error))+
geom_point(aes(factor(weeks_ahead),error,color=factor(.id)),alpha=0.4)+
facet_wrap(~.model,ncol=1)+
guides(color='none')+
ylab('Absolute Error')+
xlab('Weeks Ahead')+
ggtitle('Absolute Error by Iteration, ARIMA , NAIVE & NAIVE with Drift')
```
From the above plot, once again we see that ARIMA model performs the best based on the absolute error values.
```{r}
rmse = accidents_cv_forecast %>%
as_tibble() %>%
select(-cpd_crashes) %>%
left_join(accidents_tsbl) %>%
group_by(.id,.model) %>%
mutate(
weeks_ahead = seq(1:14)
) %>%
ungroup() %>%
filter(!is.na(cpd_crashes)) %>%
group_by(weeks_ahead,.model) %>%
summarize(
rmse = sqrt(mean((cpd_crashes - .mean)^2,na.rm=T)),
) %>%
ungroup() %>%
ggplot()+
geom_line(aes(weeks_ahead,rmse,color=.model))+
xlab("Weeks Ahead")+
ylab("RMSE")
mape = accidents_cv_forecast %>%
as_tibble() %>%
select(-cpd_crashes) %>%
left_join(accidents_tsbl) %>%
group_by(.id,.model) %>%
mutate(
weeks_ahead = seq(1:14)
) %>%
ungroup() %>%
group_by(weeks_ahead,.model) %>%
mutate(
mape = mean(abs((cpd_crashes - .mean)/cpd_crashes),na.rm=T)
) %>%
ungroup() %>%
ggplot()+
geom_line(aes(weeks_ahead,mape,color=.model))+
xlab("Weeks Ahead")+
ylab("MAPE")
mae = accidents_cv_forecast %>%
as_tibble() %>%
select(-cpd_crashes) %>%
left_join(accidents_tsbl) %>%
group_by(.id, .model) %>%
mutate(weeks_ahead = seq(1:14)) %>%
ungroup() %>%
group_by(weeks_ahead, .model) %>%
mutate(mae = mean(abs(cpd_crashes - .mean), na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_line(aes(weeks_ahead, mae, color = .model)) +
xlab("Weeks Ahead") +
ylab("MAE")
rmse+mape+mae+plot_layout(ncol = 2, byrow = TRUE) + plot_annotation(title = "Comparison of RMSE, MAPE and MAE plots")
```
- From the above plots, we can observe that naive and naive with drift models perform close to each other.
- Although the naive models show more stability over time, we can clearly see that ARIMA has the least error and hence exhibits better performance.
Based on full error metrics, we see that the RMSE value of our ARIMA model is 20.50, whereas that of NAIVE and NAIVE with drift are 32.43 and 33.25 respectively.
```{r}
accidents_cv_forecast %>%
accuracy(accidents_tsbl)%>%
data.table()
```
Here, we compare the ME, RMSE, MAE, MPE, MAPE, MASE value for each model and for each iteration of our validation scheme.
```{r}
accidents_cv_forecast %>%
filter(.id<5) %>%
group_by(.id) %>%
accuracy(accidents_tsbl) %>%
ungroup() %>%
arrange(.id)
```
## **Section 4 - Final Forecast**
Now that we have concluded that the ARIMA model is the best, let's refit the model to the entire training set and produce a forecast for the test set. The goal is to compare the performance to the "out-of-sample" actual observed values.
```{r}
best_model1 = train_data %>%
as_tsibble() %>%
model(ARIMA(cpd_crashes ~ pdq(1,0,0) + PDQ(0,0,0)))
best_model1 %>%
forecast(h=73) %>%
autoplot(
train_data %>%
as_tsibble() %>%
select(date,cpd_crashes) %>%
bind_rows(
test_data %>%
as_tsibble() %>%
select(date,cpd_crashes)
)) +
geom_vline(aes(xintercept = ymd("2023-12-01")), color = "red", linetype = "dashed") +
ggtitle(" Forecast vs Actual of CPD crashes", subtitle = 'ARIMA Forecast for Non-Seasonal data')
```
```{r}
best_model1 %>%
forecast(h=73) %>%
accuracy(test_data)
```
- The above plot shows the final forecast on the test data (January 2023- December 2023) based on the best model (*ARIMA(1,0,0)(0,0,0)*) selected basis the cross validation analysis.
- Upon visualization we see a straight line in the forecast, which was unexpected. It can be observed that the model does not seem to be producing a reasonable forecast and it is under-fitting our data.
- The RMSE of our final model is 25.76, which is slightly higher than ideal.