grepl support for use with dbplyr #149

ellmanj · 2021-07-23T16:52:36Z

There seems to be no implementation of grepl in the sql_translator for use with dbplyr.

For example:

> dplyr::tbl(db_connection, "concept") %>% filter(grepl('search_str', 'concept_name')) %>% show_query()
<SQL>
SELECT *
FROM "concept"
WHERE (grepl('search_str', "concept_name"))

Instead I expected the output to be something like this:

<SQL>
SELECT *
FROM "concept"
WHERE concept_name like '%search_str%'

Other backends for dbplyr include support for grepl, e.g. here is the postgres implementation.

The text was updated successfully, but these errors were encountered:

DyfanJones · 2021-07-23T19:50:33Z

Ah sorry about that, I will take a look at it this weekend :)

ellmanj · 2021-07-23T20:11:07Z

Thank you for the quick response! Let me know if there is anything I can do to help.

DyfanJones · 2021-07-23T20:15:51Z

Athena can utilise these Pestro Regular Expression Functions https://prestodb.io/docs/current/functions/regexp.html. I wonder if it would beneficial to map them to R equivalents. 🤔

DyfanJones · 2021-07-23T20:19:04Z

Presto has a nice function regexp_like which seems to be fairly close to R's grepl and regexpr.

 regexp_like(string, pattern) → boolean#

    Evaluates the regular expression pattern and determines if it is contained within string.

    This function is similar to the LIKE operator, except that the pattern only needs to be contained within string, rather than needing to match all of string. In other words, this performs a contains operation rather than a match operation. You can match the entire string by anchoring the pattern using ^ and $:

I wonder if grepl should be mapped to like and regexr should be mapped to regexp_like. The pattern utilised is very much what you would used when using grepl in R.

ellmanj · 2021-07-23T21:23:11Z

Hmm, yeah seems like we could use regexp_like here.

In my current project I don't have a use-case for regexpr, and I don't see an implementation for it in other dbplyr backends such as Postgres. I wonder if that's because it's not as applicable in the context of SQL queries since it returns information about the location of the matching text and it's length. Whereas grepl just returned a boolean True/False on whether or not a match exists.

With regard to implementation for grepl, I wonder the best way to implement the ignore.case option. This post seems to show two potential ways, one using lower() and LIKE, and one using regexp_like(). I wonder if there's a difference in terms of performance...

DyfanJones · 2021-07-23T22:34:52Z

hi @ellmanj,

I have an initial implementation. I have also thrown in support for stringr and lubridate functions:

remotes::install_github("DyfanJones/noctua", ref = "sql_translate")

library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

demo_iris <- iris
iris$date <- Sys.Date()

dbWriteTable(con, "demo_iris", iris)

demo_iris <- tbl(con, "demo_iris")

# return nothing
demo_iris %>%
  filter(grepl("Set", species))

demo_iris %>%
  filter(grepl("Set", species, ignore.case = T))
# Source:   lazy query [?? x 6]
# Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date      
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>    
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23
# 2          4.9         3            1.4         0.2 setosa  2021-07-23
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23
# 4          4.6         3.1          1.5         0.2 setosa  2021-07-23

# stringr example
demo_iris %>%
  mutate(str_loc = str_locate(species, "o"))
# Source:   lazy query [?? x 7]
# # Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date       str_loc
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>     <int64>
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23       4
# 2          4.9         3            1.4         0.2 setosa  2021-07-23       4
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23       4
# 4          4.6         3.1          1.5         0.2 setosa  2021-07-23       4

# lubridate example
demo_iris %>%
  mutate(day_of_week = wday(date, label = T))
# Source:   lazy query [?? x 7]
# Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date       day_of_week
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>     <chr>      
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23 Fri        
# 2          4.9         3            1.4         0.2 setosa  2021-07-23 Fri        
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23 Fri

dbRemoveTable(con, "demo_iris")

Please try it out and let me know.

DyfanJones · 2021-07-23T22:37:22Z

Hmm, yeah seems like we could use regexp_like here.

In my current project I don't have a use-case for regexpr, and I don't see an implementation for it in other dbplyr backends such as Postgres. I wonder if that's because it's not as applicable in the context of SQL queries since it returns information about the location of the matching text and it's length. Whereas grepl just returned a boolean True/False on whether or not a match exists.

With regard to implementation for grepl, I wonder the best way to implement the ignore.case option. This post seems to show two potential ways, one using lower() and LIKE, and one using regexp_like(). I wonder if there's a difference in terms of performance...

I'm not a 100% sure around performance. Might need to do a little digging. However as a side note you can use sql like in dplyr using the following:

demo_iris %>%
  filter(species %like% "%set%")

hope that gives another method you can use (%like% is already supported if you wish to use it now 😄 )

ellmanj · 2021-07-24T02:43:20Z

Please try it out and let me know.

Wow, thanks for getting on this so fast. I tried it out and it looks good to me!

DyfanJones · 2021-07-26T21:37:32Z

@ellmanj PR #150 has now been merged to master. I will update RAthena and then push the updates to the cran. In the mean time if you need to the fix please install github version:

remotes::install_github("DyfanJones/noctua")

DyfanJones · 2021-07-27T13:39:00Z

@ellmanj noctua and RAthena has been release to the cran. It should be accessible from the cran in the next couple of days

DyfanJones self-assigned this Jul 23, 2021

DyfanJones added the bug Something isn't working label Jul 23, 2021

DyfanJones added the enhancement New feature or request label Jul 26, 2021

DyfanJones added a commit that referenced this issue Jul 26, 2021

increase sql_translate_env support of R functions (#149)

9df4ef4

DyfanJones closed this as completed Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grepl support for use with dbplyr #149

grepl support for use with dbplyr #149

ellmanj commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

ellmanj commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

ellmanj commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

ellmanj commented Jul 24, 2021

DyfanJones commented Jul 26, 2021

DyfanJones commented Jul 27, 2021

grepl support for use with dbplyr #149

grepl support for use with dbplyr #149

Comments

ellmanj commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

ellmanj commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

ellmanj commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

DyfanJones commented Jul 23, 2021

ellmanj commented Jul 24, 2021

DyfanJones commented Jul 26, 2021

DyfanJones commented Jul 27, 2021