-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathL117_Apriori_Template.Rmd
267 lines (191 loc) · 5.73 KB
/
L117_Apriori_Template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
title: "Association Rules"
author: "Bert Gollnick"
output:
html_document:
toc: true
toc_float: true
toc_depth: 2
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=F, warning = F)
```
```{r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(readxl))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(arules))
suppressPackageStartupMessages(library(arulesViz))
```
# Data Understanding
We work with a dataset on Online Retail. If you want to know more about the dataset, you can check it out [here](https://archive.ics.uci.edu/ml/datasets/online+retail).
Here is the description of the provider of the dataset:
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
It has the following attributes:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.
# Data Preparation
## Raw Data Import
```{r}
# if file does not exist, download it first
file_path <- "./data/OnlineRetail.xlsx"
if (!file.exists(file_path)) {
dir.create("./data")
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
download.file(url = url,
destfile = file_path,
method = "curl")
}
retail <- read_xlsx(path = file_path)
retail %>% head
```
## Filter for missing data
We filter for missing data.
```{r}
retail_mod <- retail %>%
na.omit
```
## Data Types
Description is stored as characters
```{r}
retail_mod <- retail_mod %>%
mutate(Description = as.factor(Description)) %>%
mutate(Hour = unclass(as.POSIXlt(InvoiceDate)$hour))
```
## Exploratory Data Analysis
We check the number of unique products.
```{r}
retail_mod$Description %>%
table %>%
length
```
The dataset covers a period from `r first(retail_mod$InvoiceDate)` to
`r last(retail_mod$InvoiceDate)`.
### Count of Items per Purchase
Now we check how many items were purchased.
```{r}
nr_items_per_buy <- retail_mod %>%
group_by(InvoiceNo) %>%
summarise (items = length(InvoiceNo)) %>%
ungroup() %>%
group_by(items) %>%
summarise(count = length(items))
n_items_max <- 15
g <- nr_items_per_buy %>%
dplyr::filter(items <= n_items_max) %>%
ggplot(., aes(x = items, y = count))
g <- g + geom_col()
g <- g + scale_x_continuous(breaks = 1:n_items_max)
g <- g + labs (title = "Count and Items",
xlab = "Items bought",
ylab = "Nr. of Buys")
g <- g + theme_bw()
g
```
The distribution is reasonable and should follow [Benfords law](https://en.wikipedia.org/wiki/Benford%27s_law).
### Time of Purchase
When are items usually bought?
```{r}
time_of_buy <- retail_mod %>%
group_by(Hour) %>%
summarise(count = length(Hour))
g <- ggplot(time_of_buy, aes(Hour, count))
g <- g + geom_col()
g <- g + scale_x_continuous(breaks = 6:20)
g <- g + labs (title = "Time of Purchase",
xlab = "Hour",
ylab = "Nr. of Items sold")
g <- g + theme_bw()
g
```
We see the store is not 24/7. It opens at 6AM and closes at 9PM.
Most sells are done at lunch time.
### Best-selling Products
```{r}
bestsellers <- retail_mod %>%
group_by(Description) %>%
summarise(count = length(Description)) %>%
ungroup() %>%
arrange(desc(count)) %>%
top_n(10)
g <- bestsellers %>%
ggplot(., aes(x = reorder(Description, count),
y = count))
g <- g + geom_col()
#g <- g + scale_x_continuous(breaks = 6:20)
g <- g + labs (title = "Time of Purchase",
xlab = "Hour",
ylab = "Nr. of Items sold")
g <- g + theme_bw()
g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g
```
## Data Reshaping
We need to prepare the data and bring it into a form that transactions can be handled by the algorithm.
```{r}
# code here
```
```{r}
write.csv(x = item_sets$Description,
file = "./data/apriori_list.csv",
quote = F,
row.names = F,
col.names = F)
```
## Transformation to Transactions
Now we create a transactions-object based on this dataframe.
```{r}
# code here
```
# Model
## Item Frequency
```{r}
# code here
```
```{r}
# code here
```
The graph shows top 10 purchased items.
```{r}
# code here
```
## Cross Table
The cross table shows joint occurences of items.
```{r}
# code here
```
## Generate Rules
Calculate all rules
```{r}
# code here
```
Get the top rules, sorted by confidence.
```{r}
# code here
```
Get the top rules, sorted by lift.
```{r}
# code here
```
## Visualise Best Rules
```{r}
# code here
```
## Specific Rules for an Item
```{r}
# code here
```
```{r}
# code here
```
# Acknowledgement
We thank the author of this dataset:
Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.