Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grepl support for use with dbplyr #149

Closed
ellmanj opened this issue Jul 23, 2021 · 10 comments
Closed

grepl support for use with dbplyr #149

ellmanj opened this issue Jul 23, 2021 · 10 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@ellmanj
Copy link

ellmanj commented Jul 23, 2021

There seems to be no implementation of grepl in the sql_translator for use with dbplyr.

For example:

> dplyr::tbl(db_connection, "concept") %>% filter(grepl('search_str', 'concept_name')) %>% show_query()
<SQL>
SELECT *
FROM "concept"
WHERE (grepl('search_str', "concept_name"))

Instead I expected the output to be something like this:

<SQL>
SELECT *
FROM "concept"
WHERE concept_name like '%search_str%'

Other backends for dbplyr include support for grepl, e.g. here is the postgres implementation.

@DyfanJones
Copy link
Owner

Ah sorry about that, I will take a look at it this weekend :)

@DyfanJones DyfanJones self-assigned this Jul 23, 2021
@DyfanJones DyfanJones added the bug Something isn't working label Jul 23, 2021
@ellmanj
Copy link
Author

ellmanj commented Jul 23, 2021

Thank you for the quick response! Let me know if there is anything I can do to help.

@DyfanJones
Copy link
Owner

Athena can utilise these Pestro Regular Expression Functions https://prestodb.io/docs/current/functions/regexp.html. I wonder if it would beneficial to map them to R equivalents. 🤔

@DyfanJones
Copy link
Owner

Presto has a nice function regexp_like which seems to be fairly close to R's grepl and regexpr.

 regexp_like(string, pattern) → boolean#

    Evaluates the regular expression pattern and determines if it is contained within string.

    This function is similar to the LIKE operator, except that the pattern only needs to be contained within string, rather than needing to match all of string. In other words, this performs a contains operation rather than a match operation. You can match the entire string by anchoring the pattern using ^ and $:

I wonder if grepl should be mapped to like and regexr should be mapped to regexp_like. The pattern utilised is very much what you would used when using grepl in R.

@ellmanj
Copy link
Author

ellmanj commented Jul 23, 2021

Hmm, yeah seems like we could use regexp_like here.

In my current project I don't have a use-case for regexpr, and I don't see an implementation for it in other dbplyr backends such as Postgres. I wonder if that's because it's not as applicable in the context of SQL queries since it returns information about the location of the matching text and it's length. Whereas grepl just returned a boolean True/False on whether or not a match exists.

With regard to implementation for grepl, I wonder the best way to implement the ignore.case option. This post seems to show two potential ways, one using lower() and LIKE, and one using regexp_like(). I wonder if there's a difference in terms of performance...

@DyfanJones
Copy link
Owner

hi @ellmanj,

I have an initial implementation. I have also thrown in support for stringr and lubridate functions:

remotes::install_github("DyfanJones/noctua", ref = "sql_translate")
library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

demo_iris <- iris
iris$date <- Sys.Date()

dbWriteTable(con, "demo_iris", iris)

demo_iris <- tbl(con, "demo_iris")

# return nothing
demo_iris %>%
  filter(grepl("Set", species))

demo_iris %>%
  filter(grepl("Set", species, ignore.case = T))
# Source:   lazy query [?? x 6]
# Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date      
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>    
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23
# 2          4.9         3            1.4         0.2 setosa  2021-07-23
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23
# 4          4.6         3.1          1.5         0.2 setosa  2021-07-23

# stringr example
demo_iris %>%
  mutate(str_loc = str_locate(species, "o"))
# Source:   lazy query [?? x 7]
# # Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date       str_loc
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>     <int64>
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23       4
# 2          4.9         3            1.4         0.2 setosa  2021-07-23       4
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23       4
# 4          4.6         3.1          1.5         0.2 setosa  2021-07-23       4

# lubridate example
demo_iris %>%
  mutate(day_of_week = wday(date, label = T))
# Source:   lazy query [?? x 7]
# Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date       day_of_week
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>     <chr>      
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23 Fri        
# 2          4.9         3            1.4         0.2 setosa  2021-07-23 Fri        
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23 Fri

dbRemoveTable(con, "demo_iris")

Please try it out and let me know.

@DyfanJones
Copy link
Owner

Hmm, yeah seems like we could use regexp_like here.

In my current project I don't have a use-case for regexpr, and I don't see an implementation for it in other dbplyr backends such as Postgres. I wonder if that's because it's not as applicable in the context of SQL queries since it returns information about the location of the matching text and it's length. Whereas grepl just returned a boolean True/False on whether or not a match exists.

With regard to implementation for grepl, I wonder the best way to implement the ignore.case option. This post seems to show two potential ways, one using lower() and LIKE, and one using regexp_like(). I wonder if there's a difference in terms of performance...

I'm not a 100% sure around performance. Might need to do a little digging. However as a side note you can use sql like in dplyr using the following:

demo_iris %>%
  filter(species %like% "%set%")

hope that gives another method you can use (%like% is already supported if you wish to use it now 😄 )

@ellmanj
Copy link
Author

ellmanj commented Jul 24, 2021

Please try it out and let me know.

Wow, thanks for getting on this so fast. I tried it out and it looks good to me!

@DyfanJones DyfanJones added the enhancement New feature or request label Jul 26, 2021
@DyfanJones
Copy link
Owner

@ellmanj PR #150 has now been merged to master. I will update RAthena and then push the updates to the cran. In the mean time if you need to the fix please install github version:

remotes::install_github("DyfanJones/noctua")

@DyfanJones
Copy link
Owner

@ellmanj noctua and RAthena has been release to the cran. It should be accessible from the cran in the next couple of days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants