title | author | date | output | vignette |
---|---|---|---|---|
vertica.dplyr User Guide |
Edward Ma |
2015-07-30 |
rmarkdown::html_vignette |
%\VignetteIndexEntry{vertica.dplyr User Guide} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}
|
vertica.dplyr is an R package developed by Hewlett-Packard that provides HP Vertica backend support for the dplyr package. dplyr provides tools for data wrangling using R data frames. With vertica.dplyr, users can now do the same in Vertica tables without moving the data into R. vertica.dplyr also provides the following additional features to simplify working with HP Vertica:
- Easy CSV loading into HP Vertica.
- copy_to functionality that takes advantage of HP Vertica's fast COPY LOCAL feature.
- Connectivity to HP Vertica through either JDBC or ODBC.
A brief background of the dplyr package is provided in the following section.
To learn more, read the dplyr vignette.
dplyr offers convenient methods of manipulating data frames, which can be correlated to tables in a database. Common actions in dplyr include what are known as the "verbs" of the package.
For single data.frames (or tables):
- filter
- Filters tables by rows matching certain criteria (e.g., filter(table,col1 > 5, col2 ==3)), equivalent to (SELECT ... from table WHERE ...)
- arrange
- Orders rows (for example, arrange(tbl,desc(col1)))
- select
- Selects columns (for example, select(tbl,'col1','col2'))
- mutate
- Create new columns (for example, mutate(tbl,new_col1=col3/col2,new_col2=fun(col4,col5)))
- summarise (or summarize)
- Collapses rows belonging to a certain group into a single one (for example, for aggregations)
- Analogous to GROUP BY
There are also verbs for multiple tables, such as left_join(), inner_join(), semi_join(), anti_join(), union(), intersect().
For data that resides in a database, dplyr provides an interface to access tables and manipulate them directly inside the database as though they were local data frames. The dplyr package provides support for MySQL, Postgres, BigQuery, and SQLite backends. vertica.dplyr enables dplyr to work with HP Vertica.
vertica.dplyr adds dplyr-support for HP Vertica, as well as serving as the basic "intermediary" between R and HP Vertica, which includes data preparation, data transfer, and HP Vertica function invocation.
You can obtain vertica.dplyr from Github.
vertica.dplyr has 2 hard dependencies: assertthat and dplyr. There are additional soft dependencies on vRODBC, RJDBC, and HPdata.
Soft means that there is no enforcement of having these packages installed on the user's system to install vertica.dplyr. However, to use vertica.dplyr, at least one of RJDBC or vRODBC must be installed and configured. These soft dependencies are loaded (at which point it is required to have the package installed) when their specific functionality is invoked. For example, when creating a vertica.dplyr connection to HP Vertica with ODBC, vertica.dplyr checks to make sure that vRODBC is installed and loadable. If it isn't, an error is thrown.
You also need access to an HP Vertica database (must be configured separately), as well as either (or both) the ODBC or JDBC HP Vertica driver(s) installed and configured on your host machine (the one that runs vertica.dplyr). Please refer to external documentation for help with these steps. A guide is available at this link.
The following example demonstrates some of the fundamental data manipulation features of vertica.dplyr (and dplyr).
First, load the package in R:
# Load the vertica.dplyr package, which loads dplyr as a dependency
library(vertica.dplyr)
Note that a few of the imported dplyr functions have names that overwrite their R-base equivalents.
First, connect to HP Vertica using the src_vertica
function. There are two ways to do this:
# Connect to HP Vertica using vRODBC
vertica_odbc <- src_vertica("VerticaDSN")
# Connect to HP Vertica using RJDBC
vertica_jdbc <- src_vertica(dsn = NULL, jdbcpath="/opt/vertica/java/lib/vertica_jdbc.jar","foobar","localhost",5433,"dbadmin")
In this example, you connect to an HP Vertica DB instance named "foobar" that is listening on port 5433, with db-owner username "dbadmin", and no password. The ODBC version is much more brief, because all of the configuration details are settled inside of ODBC.ini on the system, and you only need to provide the DSN. Note that if dsn
is NULL, it is assumed that JDBC is to be used.
Both vertica_jdbc
and vertica_odbc
are src_vertica
objects, and vertica.dplyr functionality works on both object types interchangeably. The only differences are how the package prints some metadata when you look at these objects,
# show differences
vertica_odbc
## src: Vertica ODBC Connection
## -----+DSN: VerticaDSN
## -----+Host: 127.0.0.1
## -----+DB Version: 07.01.0002
## -----+ODBC Version: 03.80
## tbls: AllstarFull, asdf, AwardsManagers, AwardsPlayers,
## AwardsShareManagers, AwardsSharePlayers, bank_original,
## compute_test_table, d1, d2, Fielding, FieldingOF, FieldingPost,
## HallOfFame, LahmanData, Managers, ManagersHalf, Master, Pitching,
## PitchingPost, ref_join, Salaries, Schools, SchoolsPlayers, SeriesPost,
## TeamsFranchises, tempName, test, tsempName
vertica_jdbc
## src: Vertica JDBC Connection [dbadmin@localhost:5433/foobar]
## tbls: AllstarFull, asdf, AwardsManagers, AwardsPlayers,
## AwardsShareManagers, AwardsSharePlayers, bank_original,
## compute_test_table, d1, d2, Fielding, FieldingOF, FieldingPost,
## HallOfFame, LahmanData, Managers, ManagersHalf, Master, Pitching,
## PitchingPost, ref_join, Salaries, Schools, SchoolsPlayers, SeriesPost,
## TeamsFranchises, tempName, test, tsempName
as well as whether or not dsn
is required as a parameter for tbl2dframe
, tbl2darray
, and tbl2darrays
.
As you can see, both connection options print information about the connection and list the current user tables (and views) found in the database. For the rest of this example, the ODBC version is used and it is called vertica
:
vertica <- vertica_odbc
class(vertica)
## [1] "src_vertica" "src_sql" "src"
The following are some basic principles of a "tbl":
-
The basic dplyr object is the "tbl" (or in this case, tbl_vertica). This is a reference to SQL, some not-yet executed query that results in a new table.
-
Conceptually, a "tbl_vertica" object is nothing more than a recursive list used to represent a sequence of operations on some table belonging to HP Vertica
-
Most operations (excluding those that convert tbls to data.frames or saves them to tables/view in HP Vertica) performed on a tbl result in another tbl. Essentially, you can chain these operations together, meaning that the SQL statements that these operations generate build up in complexity.
-
tbl objects are for the most part lazily executed on Vertica; that is, when you do not require the results (for example, moving the results to R, or looking at the data by printing the results in R), nothing is executed in HP Vertica. Sometimes to generate or lazily store the objects, vertica.dplyr will need to query Vertica for some meta-data (for example asking column names or number of rows).
One way to get started with a tbl_vertica object is to directly access an existing table. For example, you can access the "Salaries" table listed above in the printout of our src_vertica objects. To do this, use the tbl
function on the src_vertica
object. This table is from the Lahman
package available on CRAN.
salaries <- tbl(vertica,"Salaries")
The salaries
object now contains information to generate a simple select statement on the "Salaries" table in HP Vertica. You can examine the SQL query associated with a tbl_vertica
by calling the sql_render
function on the object:
sql_render(salaries)
<SQL> SELECT *
FROM "Salaries"
As mentioned before, nothing is yet executed in DB. If, however, you would like to see what the data looks like, you can ask R to print salaries
.
salaries
Source: query [?? x 5]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
yearID teamID lgID playerID salary
<dbl> <fctr> <fctr> <fctr> <dbl>
1 1985 ATL NL barkele01 870000
2 1985 ATL NL bedrost01 550000
3 1985 ATL NL benedbr01 545000
4 1985 ATL NL campri01 633333
5 1985 ATL NL ceronri01 625000
6 1985 ATL NL chambch01 800000
7 1985 ATL NL dedmoje01 150000
8 1985 ATL NL forstte01 483333
9 1985 ATL NL garbege01 772000
10 1985 ATL NL harpete01 250000
# ... with more rows
The query is actually executed in the database, and the first few rows can be seen in the result (dplyr automatically takes the HEAD and displays it).
The primary way to view data is by accessing data already in database. An alternative workflow is to start with data in R and copy it to HP Vertica using copy_to
. You can also start from a CSV file and load it into HP Vertica using db_load_from_file
.
Start with data from all NYC flights in the year 2013, which you can get from the R package nycflights13
. Importing the library imports an R data.frame, flights
.
library(nycflights13)
# Peek at data.frame `flights`
head(flights,10)
# A tibble: 10 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <time>
nrow(flights)
[1] 336776
Copy it over to HP Vertica using copy_to
and view the result:
flights_vertica <- copy_to(vertica,flights,"flights")
flights_vertica
Source: query [?? x 19]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <fctr>,
# flight <dbl>, tailnum <fctr>, origin <fctr>, dest <fctr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <fctr>
copy_to
has automatically extracted the data types in the data.frame, converted them to a schema in HP Vertica, created the table in HP Vertica, and copied it over to the DB. It returns another tbl_vertica object which, in this case, is named flights_vertica
.
As one of the most densely populated regions in the United States, the New York metropolitan area is serviced by three major airports: JFK, Newark (EWR), and LaGuardia (LGA). Delays in flights are unavoidable, but do any of these airports have an edge over the others? How is flight delay affected by the time of year?
To help answer these inquiries, you may conduct a simple statistical analysis of the flight records in the year 2013.
Examining the flights_vertica
table, you can see that it has labeled each flight by some values such as date, departure/arrival times, and distance.
One basic operation is the select function. When connected to a database, this issues a corresponding SQL SELECT
on the specified table, with the specified columns.
Since this example only analyzes flight delay, the airports, and how these correspond to the dates (year is included but all of the data is from 2013 so the year column is discarded as well), you can simply select only those columns for now:
# Select columns month, day, arr_delay, and origin
q1 <- select(flights_vertica, year, month, day, origin, arr_delay)
q1
Source: query [?? x 5]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
year month day origin arr_delay
<dbl> <dbl> <dbl> <fctr> <dbl>
1 2013 1 1 EWR 11
2 2013 1 1 LGA 20
3 2013 1 1 JFK 33
4 2013 1 1 JFK -18
5 2013 1 1 LGA -25
6 2013 1 1 EWR 12
7 2013 1 1 EWR 19
8 2013 1 1 LGA -14
9 2013 1 1 JFK -8
10 2013 1 1 LGA 8
# ... with more rows
To answer the first question (determining which airport has the worst delays), you want to examine only the flights that occur between the months of February and November. During the holiday season (approximately December through January), delays may be influenced by the increase in traveling, and such data may bias the result.
Since the select
operation from before returned another dplyr tbl object, you can chain that object with the following filter
operation by passing in q1
as its first argument.
# Filter out flights in January and December
q2 <- filter(q1, year == 2013, month > 1, month < 12)
q2
Source: query [?? x 5]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
year month day origin arr_delay
<dbl> <dbl> <dbl> <fctr> <dbl>
1 2013 2 1 EWR NA
2 2013 2 1 EWR NA
3 2013 2 1 EWR NA
4 2013 2 1 EWR NA
5 2013 2 1 EWR NA
6 2013 2 1 EWR NA
7 2013 2 1 EWR NA
8 2013 2 1 EWR NA
9 2013 2 1 EWR NA
10 2013 2 1 EWR NA
# ... with more rows
The data now printed out show flights beginning with the first of February. Just as when you retrieved a reference to an HP Vertica table, you can view the query associated with this filter operation:
# Look at what we did in terms of SQL
sql_render(q2)
<SQL> SELECT *
FROM (SELECT "year" AS "year", "month" AS "month", "day" AS "day", "origin" AS "origin", "arr_delay" AS "arr_delay"
FROM "flights") "jmuwgnfpix"
WHERE (("year" = 2013.0) AND ("month" > 1.0) AND ("month" < 12.0))
As may be expected, the query resulting from a filter involves a WHERE
clause.
Notice that arr_delay
has NA values for many rows, which are NULL in HP Vertica. Get rid of these rows with incomplete data using another filter
operation on q2
:
# Filter out NA rows
q3 <- filter(q2,!is.na(arr_delay))
q3
Source: query [?? x 5]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
year month day origin arr_delay
<dbl> <dbl> <dbl> <fctr> <dbl>
1 2013 2 1 EWR 4
2 2013 2 1 EWR -4
3 2013 2 1 LGA 8
4 2013 2 1 JFK -10
5 2013 2 1 JFK 9
6 2013 2 1 EWR -14
7 2013 2 1 JFK -1
8 2013 2 1 LGA 9
9 2013 2 1 LGA 0
10 2013 2 1 LGA -4
# ... with more rows
There are no longer any NA values. dplyr has converted the R-style boolean expression on arr_delay
for NA
values into NOT NULL
in SQL.
Now that the data has been cleaned up, you can find which airport had the worst delays going out of New York City in 2013, between the months of February and November.
Essentially, these operations are equivalent to running AVG and GROUP BY operations on these columns. In this example, you use the dplyr equivalent.
First, use group_by
, which constructs the grouping clause, followed by a summarise (which takes a group_by clause as its first argument) to generate the report. Along with the report, you can also create new columns for determining the number of flights belonging to each airport (call this count
), as well as the average delay for that airport (called delay).
# Group by airport (origin)
by_origin <- group_by(q3, origin)
q4 <- summarise(by_origin,
count = n(),
delay = mean(arr_delay)
)
The summarise
function allows you to create two new columns, count
and delay
as functions of the GROUP BY origin
created by group_by
, which you store as by_origin
.
The arguments on the right-hand side, n
and mean
, correspond to the vertica.dplyr bindings of the HP Vertica analytic functions you are invoking (COUNT(*) and AVG(arr_delay)).
Before you look at the result, rank these airports by worst (most delay) to best using arrange
:
# Average delays by airport
q5 <- arrange(q4,desc(delay))
q5
Source: query [?? x 3]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
origin count delay
<fctr> <dbl> <dbl>
1 EWR 98101 7.733132
2 LGA 84702 5.370121
3 JFK 91125 5.268258
From these results you can see that, compared to other airports servicing the NYC region, EWR (Newark) had, by roughly two-and-a-half minutes, the worst average arrival delay in 2013.
Now look again at the query by accessing $query
of a tbl
object. Use the q5
variable.
sql_render(q5)
<SQL> SELECT *
FROM (SELECT "origin", count(*) AS "count", AVG("arr_delay") AS "delay"
FROM (SELECT *
FROM (SELECT *
FROM (SELECT "year" AS "year", "month" AS "month", "day" AS "day", "origin" AS "origin", "arr_delay" AS "arr_delay"
FROM "flights") "sqdztmepka"
WHERE (("year" = 2013.0) AND ("month" > 1.0) AND ("month" < 12.0))) "gbpmhczqce"
WHERE (NOT(("arr_delay") IS NULL))) "tzdmvoxcvd"
GROUP BY "origin") "psjbudtstt"
ORDER BY "delay" DESC
Note that this query is a complex query that has been built off of the initial tbl created by the copy_to
statement since before the data preparation began. This means that, if all of the intermediate results had not been printed in console, nothing from q1 through q5 would have even touched the DB.
To bring this data back into the R console, use collect
. collect
takes the saved query, executes it on the database, and brings the results back into R as a data.frame.
# Bring results back into R as a data.frame, then plot the results
airport_delays <- collect(q5)
barplot(airport_delays[["delay"]],horiz=TRUE,
names.arg=airport_delays[["origin"]],
xlab="Average delay in minutes",ylab="Airport")
In the previous sections, AVG (called as mean()) in dplyr) as well as COUNT(*) (invoked with n() in dplyr) were used with the summarise
function.
This section describes in more detail how HP Vertica functions can be invoked, namely window and aggregate functions, on different partitions of data.
Note that a complete list of currently supported functions may be found at the end of this document.
In general, a function can be invoked by simply dropping it into any mutate
or summarise
statement. mutate
adds an entire column to your tbl, so use it when you expect to get a vector of results back from your function. For example, when invoking a window function. Meanwhile, summarise
collapses all rows (belonging to a group) into one (therefore, if no group_by
is used, your function runs on all rows of your table).
An aggregate function needs to be invoked with summarise
, which necessitates the additional use of group_by
to specify the groups on which to aggregate. If it is not used with any group_by
result, the function runs on the entire table, leaving you with only one result.
If you have a function foo
that you wish to run on attribute attr
in tbl
, grouping by column col
, with additional arguments args
, run:
grouped_by_col <- group_by(tbl,col)
result <- summarise(grouped_by_col,new_col = foo(attr,args))
Occasionally, the function may not take the attr
argument, or any arguments at all (such as COUNT(*)). In this case, no args need to be passed, for example, new_col = n()
for COUNT(*)
.
Notice that a new column is created named new_col
. This can be done using mutate
as well.
The PARTITION BY SQL statement is also constructed by group_by
. This means that, for window functions requiring such a clause (which is most), it is recommended to run a group_by
on the column you wish to partition by (although, this can be overridden using the partition
parameter).
In addition to this, many window functions also have some notion of a "window" of a specified width over which the function is applied. In HP Vertica, many of them (even those that do not require explicit ordering) require an obligatory ORDER BY clause when they are invoked.
For this reason, the syntax for running a window function on HP Vertica through dplyr is unique to vertica.dplyr.
In standard dplyr syntax, you need a group_by
call followed by a mutate
or filter
. To run a window function winfun
on column col
of table tbl
, partitioning by partcol
, invoke it in the following manner:
grouped_by_partcol <- group_by(tbl,partcol)
mutate(grouped_by_partcol, new_col = winfun(col,args))
You may notice that this does not enforce any particular ordering nor windowing range on the function. Because most HP Vertica functions require the ORDER BY statement, in vertica.dplyr there is a way to make this happen in your invocation of the window function, using the order
argument.
Similarly, you can specify your window size using range
. This is usually an optional argument (leaving this alone causes an attempt to run the function with no window size specified), but it is useful when it is needed.
Using both of these arguments, with the assumption that we want to have the range between min
and max
, ordering by ordercol
, the above flow is converted to the following in vertica.dplyr:
grouped_by_partcol <- group_by(tbl,partcol)
mutate(grouped_by_partcol, new_col = winfun(col,args,range=c(min,max),order=ordercol))
The following also has the same effect, since partition
also allows you to specify partitioning:
mutate(tbl, new_col = winfun(col,args,range=c(min,max),partition=partcol,order=ordercol))
Note that the range is specified as an R vector. -Inf
is converted to ROWS UNBOUNDED PRECEDING
, Inf
is converted to ROWS UNBOUNDED FOLLOWING
.
The following example demonstrates how delays changed over the calendar year. Return to the flights example, and grab query q1
. To determine our result, it is necessary to complete the following steps:
- Remove unnecessary columns and NULL values
- Average all delays for every unique calendar day
- To get a more "smoothed" result, run another average function over these delay values, with a running window width of 10 days (5 days prior to the current row, and 5 days after).
c <- filter(select(q1,month,day,arr_delay),!is.na(arr_delay))
calendar_day <- group_by(c,month,day)
delay_by_day <- summarise(calendar_day,avg_delay = mean(arr_delay))
delays <- mutate(delay_by_day, delay_smoothed = mean(avg_delay,range=c(-5,5),
partition=NULL,order=c(month,day)))
delays
Source: query [?? x 4]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
Groups: month
month day avg_delay delay_smoothed
<dbl> <dbl> <dbl> <dbl>
1 1 1 12.6510229 5.3091754
2 1 2 12.6928879 3.8439629
3 1 3 5.7333333 2.9600202
4 1 4 -1.9328194 2.6017649
5 1 5 -1.5258020 1.7517068
6 1 6 4.2364294 1.1595273
7 1 7 -4.9473118 -1.1738523
8 1 8 -3.2275785 -0.9703107
9 1 9 -0.2642777 -1.1605861
10 1 10 -5.8988159 -0.9461796
# ... with more rows
Notice that it's necessary to set partition=NULL
to clear the implicit PARTITION BY MONTH,DAY
that exists from our earlier group_by
.
Now, plot delay_smoothed
.
delays_local <- collect(delays)
plot(delays_local[["delay_smoothed"]],xlab="Day of Year",
ylab="Average delay 5 days prior and after (minutes)")
lines(delays_local[["delay_smoothed"]])
grid(nx=5)
Delays in flights appear to peak during the summer months, with smaller peaks during the winter months (holiday season). Between days 200 and 300 (August to November), flights are very on time, and even have a tendency to be a little early.
For a list of functions that are currently tested to work in vertica.dplyr and their mappings to internal HP Vertica functions, please refer to the appropriately titled section at the end of this document.
Note: Pay particular attention to types. By default, numbers passed into R functions are of type numeric, and these do not work with HP Vertica functions that require integer values. To pass integer values, you must use "L" after the number in R, for example, func(2L,3L)
.
vertica.dplyr also allows user-defined functions (UDFs) in HP Vertica to be run. list_udf()
will allow you to take a src_vertica
connection object and will return to you a list of currently registered UDFs. This will return a data frame of UDFs, as well as their types. To invoke a UDF, simply follow the same conventions as the other functions mentioned above (using mutate or select). UDFs take in optional parameters, as specified in Vertica with a USING PARAMETERS
clause following the normal function arguments. To supply these parameters, supply a list()
in R to the params
argument in the function, corresponding to a key-value mapping for the parameters, for example, as below:
table <- tbl(vertica,"some_table")
result <- mutate(table,fun=some_udf(col1,col2,params=list(param1='x',foo=3,bar=3.4)))
The above example will result in a SELECT
on the columns of table
and SOME_UDF(col1,col2 USING PARAMETERS param1='x', foo=3, bar=3.4) AS fun
.
####Transform functions vertica.dplyr also allows users to call user-defined transform functions which are a special subclass of UDFs. The syntax to call these functions is slightly different from the general UDF syntax. They must be called with a select command.
table <- tbl(vertica,"some_table")
result <- select(table,fun=some_transform_udf(col1,col2,params=list(param1='x',foo=3,bar=3.4)))
Transform UDFs by default partition by the entire table however it is possible to call the transform udf on partitioned data:
table <- tbl(vertica,"some_table")
table <- group_by(table, col3)
result <- select(table,fun=some_transform_udf(col1,col2,params=list(param1='x',foo=3,bar=3.4)))
It is also possible to send an R function to vertica and call the function as a user defined function. Currently are several restrictions: *Any additional parameters must be less than 32 mb *The user defined functions must be installed *The function must take a data.frame input and produce a data.frame output *Any R libraries that are required by the function must be installed on vertica
The syntax is the following:
execute_custom_R_udf(FUN, my_tbl,
named_additional_arg1 = 1,
named_additional_arg2 = 2)
Here is an example that illustrates this:
copy_to(vertica, iris, "iris_table")
iris_table <- tbl(vertica, "iris")
model <- randomForest(Species ~ ., iris,ntree = 1)
result <- execute_custom_R_udf(function(x) {
library(randomForest);
predict(model,x)
}, iris_table, model = model)
result
The above code ouputs the following:
Source: query [?? x 5]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
col1
<fctr>
1 setosa
2 setosa
3 setosa
4 setosa
5 setosa
6 setosa
7 setosa
8 setosa
9 setosa
10 setosa
# ... with more rows
The previous examples demonstrate how to access tables and manipulate them with some of the most common vertica.dplyr functions. The examples include a general template for how to invoke HP Vertica functions.
The following examples go over some of the additional things that can be done using vertica.dplyr.
Many of the functions described below take a VerticaConnection
object, meaning something that may be accessed by retrieving the con
member of your src_vertica
object. In the previous examples, this would be vertica$con
.
For several HP Vertica functions or external procedures, the invocation is one without a FROM
clause. To invoke them with vertica.dplyr, supply a select
statement with the first argument a src_vertica
object, as opposed to a tbl_vertica
; for example:
select(vertica,user=user())
Source: query [?? x 1]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
user
<fctr>
1 dbadmin
vertica.dplyr allows for external procedures to be called with the following syntax:
results <- select(vertica, some_external_procedure('arg1', 'arg2'))
In the above command, we pass in the vertica source object and call an external procedure with the name "some_external_procedure" and pass it string arguments 'arg1' and 'arg2'. Below is a more concrete example:
copy_to(vertica, iris, "my_iris")
my_model <- select(vertica, kmeans('my_iris_model', 'my_iris', 'Sepal.Length, Sepal.Width, Petal.Width, Petal.Length', 3L))
my_model
The above has the following output:
Source: query [?? x 0]
Database: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 08.00.0000
-----+ODBC Version: 03.80
KMEANS
1 Finished in 6 iterations\n
More often than not, the data to be prepared already exists in HP Vertica. On some other occasions, you may want to copy data from your R instance to HP Vertica using copy_to
. Other times, you might want to create a table of your own and take control over your schema and data types.
To do so, you need to create a character vector that describes the schema of your table, and invoke db_create_table
with a VerticaConnection
.
For example, to create a table named New_Table
, with a schema of INTEGER
, VARCHAR
, and VARCHAR
corresponding to columns employeeID
, name
, and department
, run the following commands:
types <- c("integer","varchar","varchar")
names(types) <- c("employeeID","name","department")
db_create_table(vertica$con,"New_Table",types)
db_load_from_file
provides a convenient way to load data from a file. To invoke it, it's necessary to know:
- The name of the file (example: foo.csv)
- Your
src_vertica
object (example:vertica
) - The type of delimiter in the file (example: ',')
- The number of lines to skip at the beginning of the file, which is useful for files with headings (example: 1L)
- Whether to append this data to existing data in a table or overwrite it. (example: TRUE)
- The name of the table into which you would like to load the data (example: mytable)
db_load_from_file(vertica,"mytable","foo.csv",sep=",",skip=1L,append=TRUE)
Note that mytable
must already exist, and the schema must be compatible with the data you are loading. At the moment, vertica.dplyr does not automatically scan your file for a schema and create an appropriate table for you.
You can view, check for, and drop tables by using db_list_tables
, db_has_table
, and db_drop_table
, respectively:
db_list_tables(vertica$con)
## [1] "AllstarFull" "asdf" "AwardsManagers"
## [4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
## [7] "bank_original" "compute_test_table" "d1"
## [10] "d2" "Fielding" "FieldingOF"
## [13] "FieldingPost" "flights" "HallOfFame"
## [16] "LahmanData" "Managers" "ManagersHalf"
## [19] "Master" "Pitching" "PitchingPost"
## [22] "ref_join" "Salaries" "Schools"
## [25] "SchoolsPlayers" "SeriesPost" "TeamsFranchises"
## [28] "tempName" "test" "tsempName"
db_has_table(vertica$con,"foobar")
[1] FALSE
db_has_table(vertica$con,"Salaries")
[1] TRUE
db_drop_table(vertica$con,"Salaries")
Queries (tbl objects) can be explained using explain
:
explain(q5)
<SQL>
SELECT *
FROM (SELECT "origin", count(*) AS "count", AVG("arr_delay") AS "delay"
FROM (SELECT *
FROM (SELECT *
FROM (SELECT "year" AS "year", "month" AS "month", "day" AS "day", "origin" AS "origin", "arr_delay" AS "arr_delay"
FROM "flights") "xmoecdranw"
WHERE (("year" = 2013.0) AND ("month" > 1.0) AND ("month" < 12.0))) "rwadqnsown"
WHERE (NOT(("arr_delay") IS NULL))) "uzvummcynv"
GROUP BY "origin") "lcdmrospht"
ORDER BY "delay" DESC
<PLAN>
------------------------------ QUERY PLAN DESCRIPTION: ------------------------------
EXPLAIN SELECT * FROM (SELECT "origin", count(*) AS "count", AVG("arr_delay") AS "delay" FROM (SELECT * FROM (SELECT * FROM (SELECT "year" AS "year", "month" AS "month", "day" AS "day", "origin" AS "origin", "arr_delay" AS "arr_delay" FROM "flights") "xzcgbapklh" WHERE (("year" = 2013.0) AND ("month" > 1.0) AND ("month" < 12.0))) "qzzwbfvten" WHERE (NOT(("arr_delay") IS NULL))) "ckhkkbixgn" GROUP BY "origin") "cmffzzjjhs" ORDER BY "delay" DESC
Access Path:+-SORT [Cost: 282K, Rows: 3] (PATH ID: 1)
| Order: cmffzzjjhs.delay DESC
| +---> GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 282K, Rows: 3] (PATH ID: 3)
| | Aggregates: max(flights.origin), count(*), sum_float(flights.arr_delay), count(flights.arr_delay)
| | Group By: collation(flights.origin, 'en_US')
| | +---> STORAGE ACCESS for flights [Cost: 280K, Rows: 280K] (PATH ID: 4)
| | | Projection: public.flights_super
| | | Materialize: flights.arr_delay, flights.origin
| | | Filter: (flights.year = 2013.0)
| | | Filter: ((flights.month > 1.0) AND (flights.month < 12.0))
| | | Filter: (NOT (flights.arr_delay IS NULL))
compute
and db_save_view
allow you to save your tbl
objects to tables and views in HP Vertica, respectively. For a given tbl
named foo
:
# Save to a table named "mytable"
compute(foo,name="mytable")
# Save to a view named "myview"
db_save_view(foo,name="myview")
Functions that are not listed here have not been tested, but may be invokable by using the HP Vertica function name from within vertica.dplyr.
HP Vertica Function Name : Name in vertica.dplyr
- lag : lag
- conditional_change_event : changed
- conditional_true_event : isTrue
- ntile : ntile
- lead : lead
- median : median
- row_number : row_number
- first_value : head
- last_value : tail
- exponential_moving_average : exp_mvg_avg
- percentile_cont : percentile_cont
- percentile_disc : percentile_disc
- rank : rank, min_rank
- dense_rank : dense_rank
- percent_rank : percent_rank
- cume_dist : cume_dist
- row_number : row_num
- stddev : sd
- stddev_pop : sd_pop
- var_pop : var_pop
- var_samp : var_samp
- avg : mean
- sum : sum
- min : min
- max : max
- count(*) : n
- regr_slope : lm_slope
- regr_intercept : lm_intercept
- covar_samp : cov
- bit_and : bitwAnd
- bit_or : bitwOr
- bit_xor : bitwXor
For descriptions of and help for these functions, refer to the appropriate section(s) in the HP Vertica User Manual.
The following functions have been tested and verified:
- tbl
- select
- filter
- summarise
- group_by
- arrange
- collect
- mutate
- explain
- compute
- anti_join
- semi_join
- copy_to
- %>%
See the dplyr docs for how these work. Many others may be functional, but have not been verified.
- db_drop_view
- db_save_view
The following is a list of functions that are used to implement many of the functions listed above in dplyr, and are, therefore, exported as part of the package for dispatch in certain generic functions. They may be used, but are not meant to be directly invoked, and are thus not documented for their behavior:
- db_insert_into
- db_save_query
- db_query_fields
- db_query_rows
- db_explain
- db_analyze
- db_create_index
- db_data_type
right_join
andfull_join
are known to not work due to a lack of a tbl_sql implementation indplyr
.
This user guide touches upon most of the useful features in vertica.dplyr.
For more information and help on the functions that vertica.dplyr exports, please refer to their manual pages.
For more literature on dplyr, you may look at the documentation available for it in Hadley Wickham's RStudio pages.