-
Notifications
You must be signed in to change notification settings - Fork 1
UDFs Background Information
Drill provides documentation about how to create a User Defined Function (UDF). The information is procedural and walks you through the steps. While this is a great start, some people would like to know what is happening "behind the scenes." That is the topic of this page.
To avoid excessive duplication, this page assumes you are familiar with the existing documentation. We'll touch on some sections to offer simpler alternates, but mostly count on the Drill documentation for the basics of setting up a Maven project, etc.
At first glance, Drill UDFs appear to have an odd structure. After all, Java supports functions and that is all a UDF is, really. But, it seems that Drill UDFs evolved from Hive UDFs, then the design was adjusted to fit the code generation model used within Drill's own operators. The result is a complex interface unique to Drill.
A drawback of Drill's interface is that UDFs are very hard to unit test. (We all unit test our code before bolting it onto Drill, don't we? Good, I thought so.)
The documentation explains how to create a project external to Drill to hold your UDF. This is certainly the form you want to use once your code works. But, to debug your UDF, and to look at the source code referenced here, we have to use an alternative structure temporarily.
Drill provides no API in the normal sense. Instead, Drill provides all sources (it is open source.) Drill assumes that each developer (of a UDF, or storage plugin, etc.) will use the sources needed for that project.
Drill also provides testing tools that we will want to use. But, because of the way has been set up to work with Maven, those tools are available only if your code lives within Drill's java-exec
package. (Yes, Drill could use some work to improve it's API. Volunteers?)
So, for our function, we will create the following new Java package within java-exec
: org.apache.drill.contrib.udfExample
. Here is how:
- Download and build Drill as explained in the documentation.
- Using your favorite Git tool, create a new branch from
master
calledudf-example
. - Use
mvn clean install
to build Drill from sources. - Load Drill into your favorite IDE (IntelliJ or Eclipse.)
- Within
drill-java-exec
, undersrc/main/java
, create theorg.apache.drill.contrib.udfExample
package. - Within
drill-java-exec
, undersrc/test/java
, also create theorg.apache.drill.contrib.udfExample
package.
Why have we done this? So we can now follow good Test-driven-development (TDD) practice and start with a test. Let's deviate from TDD a bit and create a test that passes using the test framework.
In Eclipse:
- Select the test package we just created.
- Choose New → JUnit Test Case.
- Name:
ExampleUdfTest
. - Superclass:
org.apache.drill.test.ClusterTest
. - Click Finish.
You now have a blank test case. We need to do two things to get started.
First, we must put the Apache copyright at the top of the file. Just pick any other Java file in Drill and copy the copyright notice. (If you forget to do that, Drill's build will fail when next you build from Maven.)
Then, we need a magic bit of code that will start an embedded Drillbit for us. (Later we may want to set config options as shown in org.apache.drill.test.ExampleTest
, but for now we'll use the defaults:
public class ExampleUdfTest extends ClusterTest {
@ClassRule
public static final BaseDirTestWatcher dirTestWatcher = new BaseDirTestWatcher();
@BeforeClass
public static void setup() throws Exception {
startCluster(ClusterFixture.builder(dirTestWatcher));
}
}
(The need for the dirTestWatcher
may be removed in an upcoming commit; you can use the one in a super class.)
Next, let's create a demo test:
@Test
public void demoTest() {
String sql = "SELECT * FROM `cp`.`employee.json` LIMIT 3";
client.queryBuilder().sql(sql).printCsv();
}
Run this test as a JUnit test and verify that it does, in fact, print three lines of output. If so, you have verified that that you have a working Drill environment. Also, we now have a handy fixture to use to exercise our UDF as we build it.
Drill implements each UDF as a class. How does Drill know which classes are UDFs? By using a Drill-defined annotation and implementing a Drill-defined interface. Following the documentation, let's explain this in the context of an example function that that fills a gap in Drill's math and trig functions: the sin
function:
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
@FunctionTemplate(
name = "sin",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class SinFunction implements DrillSimpleFunc {
Here we immediately see a benefit of working with the Drill project: we can traverse to the source for these various items, including the annotation. Fortunately, the documentation for the FunctionTemplate
has some very helpful explanations.
For now, let's use the scope
and nulls
value above; we'll discuss them at length later.
The implementation class name must follow Java naming conventions and cannot be the same as an existing class. So, we can't name our class "sin". Instead, the name of the class is independent of the function name. We set the function name using the name
argument of the FunctionTemplate
annotation.
Function names need follow very few rules:
- Cannot be blank or null.
- Should be a valid SQL identifier that does not duplicate a SQL keyword, Drill built-in function or UDF you have imported.
- Otherwise, you must escape the name. (Quoting does not help, however, if your function name duplicates another.)
The third rule above says we can name our function something like ++
as long as we quote it:
SELECT `++`(1) FROM VALUES(1);
Debugging UDFs can often be a black box. Sometimes things work and sometimes they don't It can be hard to know where to look for the problem. One way to reduce the frustration is to test early and often. It both verifies our code and builds our confidence that we are, in fact, on the right path.
Here we will test the annotation just created. This lets us look at the function the way Drill does.
@Test
public void testAnnotation() {
Class<? extends DrillSimpleFunc> fnClass = SinFunction.class;
FunctionTemplate fnDefn = fnClass.getAnnotation(FunctionTemplate.class);
assertNotNull(fnDefn);
assertEquals("sin", fnDefn.name());
assertEquals(FunctionScope.SIMPLE, fnDefn.scope());
assertEquals(NullHandling.NULL_IF_NULL, fnDefn.nulls());
}
The code grabs the class we just created, fetches the annotation, and verifies the three values we set. You can use a variation on this theme to use your debugger (or print statements) to look at the annotation fields.
In general, a UDF is a function of zero or more arguments that returns a single value:
y = sin(x)
Although Drill is schema-free at the level of input files, it turns out Drill is strongly typed internally. As a result, the arguments and return value above must have a declared type. (We'll see later how Drill matches types between your function and the Drill columns stored in value vectors So, we really need something more like:
double y = sin(double x)
Systems such as Javascript or Groovy use introspection to get the information directly from Java.
Drill uses introspection, but with hints based on a set of Drill-defined annotations. For example:
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.holders.Float8Holder;
public class SinFunction implements DrillSimpleFunc {
@Param public Float8Holder x;
@Output public Float8Holder out;
The @Param
argument declares the parameters in the order they are passed to a function. (We'll see a more advanced example later.) The @Output parameter declares the return value (the output.)
The above may seem a bit odd: why are we declaring fields in the class to pass in values to a function? Two reasons.
First, the above structure is overkill for a true function such as this one, but is necessary when we look at aggregate functions.
Second, Drill generates Java code to call each function. Presumably this structure is simpler than a true function call because of the way that Drill optimizes function calls. (More on this topic later also.)
For now, let's just remember to use the argument structure.
The arguments in the example are declared as public
, but those in the Drill example default to protected. Which is right? As it turns out, either is fine: Drill never actually uses your compiled code. (Again, more on this later.) We have marked them public so we can more easily create unit tests.
Next we note that the arguments are something called a Float8Holder
rather than a Java double
. The reason for this is three-fold (which we will explore deeper later):
- The holder structure is convenient for code generation.
- The holders can store not just the
value
but also whether the value is null. - Some types (such as
VARCHAR
) require more than just a simple value.
Different holder types exist for each Drill data type and cardinality (nullable, non-nullable or repeated.) Here is the (abbreviate) definition of the Float8Holder
:
public final class Float8Holder implements ValueHolder {
public static final MajorType TYPE = Types.required(MinorType.FLOAT8);
public static final int WIDTH = 8;
public double value;
The class tells us Drill's internal notation for a required (that is, "non-nullable") FLOAT8
. Tells us that the data values are 8-bytes wide. And, most importantly, it gives us the value as a Java double. (There are no getters or setters for the value; code generation does not use them.)
So, this looks pretty simple: we get our input value from x.value
and we put our return value into out.value
. Not quite as easy as using Java semantics, but not hard.
Back when we asked the IDE to create our function class, it created two empty methods:
@Override
public void setup() { }
@Override
public void eval() { }
Because we are writing a simple function (one value in, one value out), we can ignore the setup()
method for now. We will instead focus on the one that does the real work: eval()
. Let's implement our sin
function:
@Override
public void eval() {
out.value = Math.sin(x.value);
}
We cheated: we just let Java do the real work.
Next we go back to our test class and add a test for the function itself, calling it as Drill does (again, this is not really what Drill does, but hold onto that thought):
@Test
public void testFn() {
SinFunction sinFn = new SinFunction();
sinFn.setup();
sinFn.x = new Float8Holder();
sinFn.out = new Float8Holder();
sinFn.x.value = Math.PI/2;
sinFn.eval();
assertEquals(1.0D, sinFn.out.value, 0.001D);
}
The above is perfectly fine, but tedious. What if we want to test ten different values? To make life easier, we can add test-only methods:
import com.google.common.annotations.VisibleForTesting;
...
@VisibleForTesting
public static SinFunction instance() {
SinFunction fn = new SinFunction();
fn.x = new Float8Holder();
fn.out = new Float8Holder();
fn.setup();
return fn;
}
@VisibleForTesting
public double call(double x) {
this.x.value = x;
eval();
return out.value;
}
Our test now becomes much simpler:
@Test
public void testFn() {
SinFunction sinFn = SinFunction.instance();
assertEquals(0D, sinFn.call(0), 0.001D);
assertEquals(1.0D, sinFn.call(Math.PI/2), 0.001D);
assertEquals(0, sinFn.call(Math.PI), 0.001D);
assertEquals(-1.0D, sinFn.call(3 * Math.PI/2), 0.001D);
}
Much better: we can now easily test all interesting cases.
(If you are following along, you should now experience the beauty of this form of testing: we are always just seconds away from running our next test.)
The next step is to test the function with Drill itself. Because our code is within Drill, and we are using a test framework that starts the server, we need only add a test: