Skip to content

Commit

Permalink
various small translate-sql vignette corrections (#1550)
Browse files Browse the repository at this point in the history
  • Loading branch information
bbolker authored Oct 31, 2024
1 parent d51bc0d commit 1abf1d6
Showing 1 changed file with 16 additions and 16 deletions.
32 changes: 16 additions & 16 deletions vignettes/translation-function.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ con <- simulate_dbi()
translate_sql((x + y) / 2, con = con)
```

`translate_sql()` takes an optional `con` parameter. If not supplied, this causes dplyr to generate (approximately) SQL-92 compliant SQL. If supplied, dplyr uses `sql_translation()` to look up a custom environment which makes it possible for different databases to generate slightly different SQL: see `vignette("new-backend")` for more details. You can use the various simulate helpers to see the translations used by different backends:
`translate_sql()` takes an optional `con` parameter. If not supplied, this causes dbplyr to generate (approximately) SQL-92 compliant SQL. If supplied, dbplyr uses `sql_translation()` to look up a custom environment which makes it possible for different databases to generate slightly different SQL: see `vignette("new-backend")` for more details. You can use the various simulate helpers to see the translations used by different backends:

```{r}
translate_sql(x ^ 2L, con = con)
translate_sql(x ^ 2L, con = simulate_sqlite())
translate_sql(x ^ 2L, con = simulate_access())
```

Perfect translation is not possible because databases don't have all the functions that R does. The goal of dplyr is to provide a semantic rather than a literal translation: what you mean, rather than precisely what is done. In fact, even for functions that exist both in databases and R, you shouldn't expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, `mean()` loops through the data twice. R's `mean()` also provides a `trim` option for computing trimmed means; this is something that databases do not provide.
Perfect translation is not possible because databases don't have all the functions that R does. The goal of dbplyr is to provide a semantic rather than a literal translation: what you mean, rather than precisely what is done. In fact, even for functions that exist both in databases and in R, you shouldn't expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, `mean()` loops through the data twice. R's `mean()` also provides a `trim` option for computing trimmed means; this is something that databases do not provide.

If you're interested in how `translate_sql()` is implemented, the basic techniques that underlie the implementation of `translate_sql()` are described in ["Advanced R"](https://adv-r.hadley.nz/translation.html).

Expand All @@ -63,7 +63,7 @@ The following examples work through some of the basic differences between R and
```
* R and SQL have different defaults for integers and reals.
In R, 1 is a real, and 1L is an integer. In SQL, 1 is an integer, and 1.0 is a real
In R, 1 is a real, and 1L is an integer. In SQL, 1 is an integer, and 1.0 is a real.
```{r}
translate_sql(1, con = con)
Expand Down Expand Up @@ -104,7 +104,7 @@ dbplyr no longer translates `%/%` because there's no robust cross-database trans

### Aggregation

All database provide translation for the basic aggregations: `mean()`, `sum()`, `min()`, `max()`, `sd()`, `var()`. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. The aggregation functions warn you about this important difference:
All databases provide translation for the basic aggregations: `mean()`, `sum()`, `min()`, `max()`, `sd()`, `var()`. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. The aggregation functions warn you about this important difference:

```{r}
translate_sql(mean(x), con = con)
Expand All @@ -119,7 +119,7 @@ translate_sql(mean(x, na.rm = TRUE), window = FALSE, con = con)

### Conditional evaluation

`if` and `switch()` are translate to `CASE WHEN`:
`if` and `switch()` are translated to `CASE WHEN`:

```{r}
translate_sql(if (x > 5) "big" else "small", con = con)
Expand All @@ -135,7 +135,7 @@ translate_sql(switch(x, a = 1L, b = 2L, 3L), con = con)

## Unknown functions

Any function that dplyr doesn't know how to convert is left as is. This means that database functions that are not covered by dplyr can often be used directly via `translate_sql()`.
Any function that dbplyr doesn't know how to convert is left as is. This means that database functions that are not covered by dbplyr can often be used directly via `translate_sql()`.

### Prefix functions

Expand All @@ -145,15 +145,15 @@ Any function that dbplyr doesn't know about will be left as is:
translate_sql(foofify(x, y), con = con)
```

Because SQL functions are general case insensitive, I recommend using upper case when you're using SQL functions in R code. That makes it easier to spot that you're doing something unusual:
Because SQL functions are generally case insensitive, I recommend using upper case when you're using SQL functions in R code. That makes it easier to spot that you're doing something unusual:

```{r}
translate_sql(FOOFIFY(x, y), con = con)
```

### Infix functions

As well as prefix functions (where the name of the function comes before the arguments), dbplyr also translates infix functions. That allows you to use expressions like `LIKE` which does a limited form of pattern matching:
As well as prefix functions (where the name of the function comes before the arguments), dbplyr also translates infix functions. That allows you to use expressions like `LIKE`, which does a limited form of pattern matching:

```{r}
translate_sql(x %LIKE% "%foo%", con = con)
Expand Down Expand Up @@ -190,7 +190,7 @@ mf %>%

### Error for unknown translations

If needed, you can also force dbplyr to error if it doesn't know how to translate a function with the `dplyr.strict_sql` option:
If needed, you can also use the `dplyr.strict_sql` option to force dbplyr to error if it doesn't know how to translate a function:

```{r}
#| error = TRUE
Expand Down Expand Up @@ -245,16 +245,16 @@ Things get a little trickier with window functions, because SQL's window functio
knitr::include_graphics("windows.png", dpi = 300)
```
Of the many possible specifications, there are only three that commonly
Of the many possible specifications, only three are commonly
used. They select between aggregation variants:
* Recycled: `BETWEEN UNBOUND PRECEEDING AND UNBOUND FOLLOWING`
* Recycled: `BETWEEN UNBOUND PRECEDING AND UNBOUND FOLLOWING`
* Cumulative: `BETWEEN UNBOUND PRECEEDING AND CURRENT ROW`
* Cumulative: `BETWEEN UNBOUND PRECEDING AND CURRENT ROW`
* Rolling: `BETWEEN 2 PRECEEDING AND 2 FOLLOWING`
* Rolling: `BETWEEN 2 PRECEDING AND 2 FOLLOWING`
dplyr generates the frame clause based on whether your using a recycled
dbplyr generates the frame clause based on whether you're using a recycled
aggregate or a cumulative aggregate.
To see how individual window functions are translated to SQL, we can again use `translate_sql()`:
Expand All @@ -266,14 +266,14 @@ translate_sql(ntile(G, 2), con = con)
translate_sql(lag(G), con = con)
```

If the tbl has been grouped or arranged previously in the pipeline, then dplyr will use that information to set the "partition by" and "order by" clauses. For interactive exploration, you can achieve the same effect by setting the `vars_group` and `vars_order` arguments to `translate_sql()`
If the tbl has been grouped or arranged previously in the pipeline, then dplyr will use that information to set the "partition by" and "order by" clauses. For interactive exploration, you can achieve the same effect by setting the `vars_group` and `vars_order` arguments to `translate_sql()`:

```{r}
translate_sql(cummean(G), vars_order = "year", con = con)
translate_sql(rank(), vars_group = "ID", con = con)
```

There are some challenges when translating window functions between R and SQL, because dplyr tries to keep the window functions as similar as possible to both the existing R analogues and to the SQL functions. This means that there are three ways to control the order clause depending on which window function you're using:
There are some challenges when translating window functions between R and SQL, because dbplyr tries to keep the window functions as similar as possible to both the existing R analogues and to the SQL functions. This means that there are three ways to control the order clause depending on which window function you're using:

* For ranking functions, the ordering variable is the first argument: `rank(x)`,
`ntile(y, 2)`. If omitted or `NULL`, will use the default ordering associated
Expand Down

0 comments on commit 1abf1d6

Please sign in to comment.