-
Notifications
You must be signed in to change notification settings - Fork 14
/
098-leveraging-database-views.Rmd
316 lines (235 loc) · 14.4 KB
/
098-leveraging-database-views.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# Leveraging Database Views {#chapter_leveraging-database-views}
> This chapter demonstrates how to:
>
> * Understand database views and their uses
> * Unpack a database view to see what it's doing
> * Reproduce a database view with dplyr code
> * Write an alternative to a view that provides more details
> * Create a database view either for personal use or for submittal to your enterprise DBA
## Setup our standard working environment
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits = 10)
sleep_default <- 1
```
Use these libraries:
```{r libraries, message=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(connections)
library(glue)
require(knitr)
library(dbplyr)
library(sqlpetr)
library(bookdown)
library(lubridate)
library(gt)
```
Connect to `adventureworks`:
```{r Start adventureworks}
sp_docker_start("adventureworks")
Sys.sleep(sleep_default)
```
```{r Connect to adventureworks, warning=FALSE, message=FALSE}
# con <- connection_open( # use in an interactive session
con <- dbConnect( # use in other settings
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
host = "localhost",
port = 5432,
user = "postgres",
password = "postgres",
dbname = "adventureworks"
)
dbExecute(con, "set search_path to sales;") # so that `dbListFields()` works
```
## The role of database `views`
A database `view` is an SQL query that is stored in the database. Most `views` are used for data retrieval, since they usually denormalize the tables involved. Because they are standardized and well-understood, they can save you a lot of work and document a query that can serve as a model to build on.
### Why database `views` are useful
Database `views` are useful for many reasons.
* **Authoritative**: database `views` are typically written by the business application vendor or DBA, so they contain authoritative knowledge about the structure and intended use of the database.
* **Performance**: `views` are designed to gather data in an efficient way, using all the indexes in an efficient sequence and doing as much work on the database server as possible.
* **Abstraction**: `views` are abstractions or simplifications of complex queries that provide customary (useful) aggregations. Common examples would be monthly totals or aggregation of activity tied to one individual.
* **Reuse**: a `view` puts commonly used code in one place where it can be used for many purposes by many people. If there is a change or a problem found in a `view`, it only needs to be fixed in one place, rather than having to change many places downstream.
* **Security**: a view can give selective access to someone who does not have access to underlying tables or columns.
* **Provenance**: `views` standardize data provenance. For example, the `AdventureWorks` database all of them are named in a consistent way that suggests the underlying tables that they query. And they all start with a **v**.
The bottom line is that `views` can save you a lot of work.
### Rely on -- **and** be critical of -- `views`
Because they represent a commonly used view of the database, it might seem like a `view` is always right. Even though they are conventional and authorized, they may still need verification or auditing, especially when used for a purpose other than the original intent. They can guide you toward what you need from the database but they could also mislead because they are easy to use and available. People may forget why a specific view exists and who is using it. Therefore any given view might be a forgotten vestige. part of a production data pipeline or a priceless nugget of insight. Who knows? Consider the `view`'s owner, schema, whether it's a materialized index view or not, if it has a trigger and what the likely intention was behind the view.
## Unpacking the elements of a `view` in the Tidyverse
Since a view is in some ways just like an ordinary table, we can use familiar tools in the same way as they are used on a database table. For example, the simplest way of getting a list of columns in a `view` is the same as it is for a regular table:
```{r}
dbListFields(con, "vsalespersonsalesbyfiscalyearsdata")
```
### Use a `view` just like any other table
From a retrieval perspective a database `view` is just like any other table. Using a view to retrieve data from the database will be completely standard across all flavors of SQL.
```{r Use a view just like a regular table}
v_salesperson_sales_by_fiscal_years_data <-
tbl(con, in_schema("sales","vsalespersonsalesbyfiscalyearsdata")) %>%
collect()
str(v_salesperson_sales_by_fiscal_years_data)
```
As we will see, our sample `view`, `vsalespersonsalesbyfiscalyearsdata` joins 5 different tables. We can assume that subsetting or calculation on any of the columns in the component tables will happen behind the scenes, on the database side, and done correctly. For example, the following query filters on a column that exists in only one of the `view`'s component tables.
```{r Subset using the information in one constituent table}
tbl(con, in_schema("sales","vsalespersonsalesbyfiscalyearsdata")) %>%
count(salesterritory, fiscalyear) %>%
collect() %>% # ---- pull data here ---- #
pivot_wider(names_from = fiscalyear, values_from = n, names_prefix = "FY_")
```
Although finding out what a view does behind the scenes requires that you use functions that are **not** standard, doing so has several general purposes:
* It is satisfying to know what's going on behind the scenes.
* Specific elements or components of a `view` might be worth plagiarizing or incorporating in our queries.
* It is necessary to understand the mechanics of a `view` if we are going to build on what it does or intend to extend or modify it.
### SQL source code
Functions for inspecting a view itself are not part of the ANSI standard, so they will be [database-specific](https://www.postgresql.org/docs/9.5/functions-info.html). Here is the code to retrieve a PostgreSQL view (using the `pg_get_viewdef` function):
```{r}
view_definition <- dbGetQuery(con, "select
pg_get_viewdef('sales.vsalespersonsalesbyfiscalyearsdata',
true)")
```
The PostgreSQL `pg_get_viewdef` function returns a data frame with one column named `pg_get_viewdef` and one row. To properly view its contents, the `\n` character strings need to be turned into new-lines.
```{r display the view source}
cat(unlist(view_definition$pg_get_viewdef))
```
Even if you don't intend to become completely fluent in SQL, it's useful to study as much of it as possible. Studying the SQL in a view is particularly useful to:
* Test your understanding of the database structure, elements, and usage
* Extend what's already been done to extract useful data from the database
### The ERD as context for SQL code
A database Entity Relationship Diagram (ERD) is very helpful in making sense of the SQL in a `view`. The ERD for `AdventureWorks` is [here](https://i.stack.imgur.com/LMu4W.gif). If a published ERD is not available, a tool like the PostgreSQL *pg_modeler* is capable of generating an ERD (or at least describing the portion of the database that is visible to you).
### Selecting relevant tables and columns
Before bginning to write code, it can be helpful to actually mark up the ERD to identify the specific tables that are involved in the view you are going to reproduce.
![](screenshots/AW-2008-OLTP-ERD.gif)
Define each table that is involved and identify the columns that will be needed from that table. The `sales.vsalespersonsalesbyfiscalyearsdata` view joins data from five different tables:
1. sales_order_header
2. sales_territory
3. sales_person
4. employee
5. person
For each of the tables in the `view`, we select the columns that appear in the `sales.vsalespersonsalesbyfiscalyearsdata`. Selecting columns in this way prevents joins that `dbplyr` would make automatically based on common column names, such as `rowguid` and `ModifiedDate` columns, which appear in almost all `AdventureWorks` tables. In the following code we follow the convention that any column that we change or create on the fly uses a snake case naming convention.
```{r}
sales_order_header <- tbl(con, in_schema("sales", "salesorderheader")) %>%
select(orderdate, salespersonid, subtotal)
sales_territory <- tbl(con, in_schema("sales", "salesterritory")) %>%
select(territoryid, territory_name = name)
sales_person <- tbl(con, in_schema("sales", "salesperson")) %>%
select(businessentityid, territoryid)
employee <- tbl(con, in_schema("humanresources", "employee")) %>%
select(businessentityid, jobtitle)
```
In addition to selecting rows as shown in the previous statements, `mutate` and other functions help us replicate code in the `view` such as:
((p.firstname::text || ' '::text) ||
COALESCE(p.middlename::text || ' '::text,
''::text)) || p.lastname::text AS fullname
The following dplyr code pastes the first, middle and last names together to make `full_name`:
```{r}
person <- tbl(con, in_schema("person", "person")) %>%
mutate(full_name = paste(firstname, middlename, lastname)) %>%
select(businessentityid, full_name)
```
Double-check on the names that are defined in each `tbl` object. The following function will show the names of columns in the tables we've defined:
```{r}
getnames <- function(table) {
{table} %>%
collect(n = 5) %>% # ---- pull data here ---- #
names()
}
```
Verify the names selected:
```{r}
getnames(sales_order_header)
getnames(sales_territory)
getnames(sales_person)
getnames(employee)
getnames(person)
```
### Join the tables together
First, join and download all of the data pertaining to a person. Notice that since each of these 4 tables contain `businessentityid`, dplyr will join them all on that common column automatically. And since we know that all of these tables are small, we don't mind a query that joins and downloads all the data.
```{r}
salesperson_info <- sales_person %>%
left_join(employee) %>%
left_join(person) %>%
left_join(sales_territory) %>%
collect()
str(salesperson_info)
```
The one part of the view that we haven't replicated is:
`date_part('year'::text, soh.orderdate`
`+ '6 mons'::interval) AS fiscalyear`
The `lubridate` package makes it very easy to convert `orderdate` to `fiscal_year`. Doing that same conversion without lubridate (e.g., only dplyr and **ANSI-STANDARD** SQL) is harder. Therefore we just pull the data from the server after the `left_join` and do the rest of the job on the R side. Note that this query doesn't correct the problematic entry dates that we explored in the chapter on [Asking Business Questions From a Single Table](#chapter_exploring-a-single-table). That will collapse many rows into a much smaller table. We know from our previous investigation that Sales Rep into sales are recorded more or less once a month. Therefore most of the crunching in this query happens on the database server side.
```{r}
sales_data_fiscal_year <- sales_person %>%
left_join(sales_order_header, by = c("businessentityid" = "salespersonid")) %>%
group_by(businessentityid, orderdate) %>%
summarize(sales_total = sum(subtotal, na.rm = TRUE)) %>%
mutate(
orderdate = as.Date(orderdate),
day = day(orderdate)
) %>%
collect() %>% # ---- pull data here ---- #
mutate(
fiscal_year = year(orderdate %m+% months(6))
) %>%
ungroup() %>%
group_by(businessentityid, fiscal_year) %>%
summarize(sales_total = sum(sales_total, na.rm = FALSE)) %>%
ungroup()
```
Put the two parts together: `sales_data_fiscal_year` and `person_info` to yield the final query.
```{r}
salesperson_sales_by_fiscal_years_dplyr <- sales_data_fiscal_year %>%
left_join(salesperson_info) %>%
filter(!is.na(territoryid))
```
Notice that we're dropping the Sales Managers who appear in the `salesperson_info` data frame because they don't have a `territoryid`.
## Compare the official view and the dplyr output
Use `pivot_wider` to make it easier to compare the native `view` to our dplyr replicate.
```{r}
salesperson_sales_by_fiscal_years_dplyr %>%
select(-jobtitle, -businessentityid, -territoryid) %>%
pivot_wider(names_from = fiscal_year, values_from = sales_total,
values_fill = list(sales_total = 0)) %>%
arrange(territory_name, full_name) %>%
filter(territory_name == "Canada")
v_salesperson_sales_by_fiscal_years_data %>%
select(-jobtitle, -salespersonid) %>%
pivot_wider(names_from = fiscalyear, values_from = salestotal,
values_fill = list(salestotal = 0)) %>%
arrange(salesterritory, fullname) %>%
filter(salesterritory == "Canada")
```
The yearly totals match exactly. The column names don't match up, because we are using snake case convention for derived elements.
## Revise the view to summarize by quarter not fiscal year
To summarize sales data by SAles Rep and quarter requires the `%m+%` infix operator from lubridate. The interleaved comments in the query below has hints that explain it. The totals in this revised query are off by a rounding error from the totals shown above in the fiscal year summaries.
```{r}
tbl(con, in_schema("sales", "salesorderheader")) %>%
group_by(salespersonid, orderdate) %>%
summarize(subtotal = sum(subtotal, na.rm = TRUE), digits = 0) %>%
collect() %>% # ---- pull data here ---- #
# Adding 6 months to orderdate requires a lubridate function
mutate(orderdate = as.Date(orderdate) %m+% months(6),
year = year(orderdate),
quarter = quarter(orderdate)) %>%
ungroup() %>%
group_by(salespersonid, year, quarter) %>%
summarize(subtotal = round(sum(subtotal, na.rm = TRUE), digits = 0)) %>%
ungroup() %>%
# Join with the person information previously gathered
left_join(salesperson_info, by = c("salespersonid" = "businessentityid")) %>%
filter(territory_name == "Canada") %>%
# Pivot to make it easier to see what's going on
pivot_wider(names_from = quarter, values_from = subtotal,
values_fill = list(Q1 = 0, Q2 = 0, Q3 = 0, Q4 = 0), names_prefix = "Q", id_cols = full_name:year) %>%
select(`Name` = full_name, year, Q1, Q2, Q3, Q4) %>%
mutate(`Year Total` = Q1 + Q2 + Q3 + Q4) %>%
head(., n = 10) %>%
gt() %>%
fmt_number(use_seps = TRUE, decimals = 0, columns = vars(Q1,Q2, Q3, Q4, `Year Total`))
```
## Clean up and close down
```{r}
connection_close(con) # Use in an interactive setting
# dbDisconnect(con) # Use in non-interactive setting
```