-
Notifications
You must be signed in to change notification settings - Fork 980
Code Generation and "Short, Fat" Queries
Drill is a query engine for "Big Data" queries: scans of files with billions of records distributed over large clusters. As it turns out, sometimes people want to use Drill for smaller queries: queries that could run just as well in a typical database such as MySQL. The advantage of Drill, however, is the lack of an ETL step: no need to load data into a table. For this reason, people seek to use Drill even on small data files.
We can speed up such uses cases by 33% using plain-old Java compilation. We can achieve a 2.5x speedup by using "pre-written" code in selection vectors that is optimized for performance.
This write-up discusses work done to analyze performance of one such query: a "short, fat" query that has just 6000 rows, but 240+ columns, mostly of float columns representing money amounts.
It turns out we can reduce run-time cost by 33% by using plain-old Java without byte-code fixups, and using the JDK compiler.
The query profile showed setup time for the selection vector remover operation takes over 50% of the run time (the first time.) Most of that is code gen (code gen, compilation, byte-code fixup.)
To get a clear read, an option was added to turn off the code cache so that the code is rebuilt on each run. Then, the query was run 10 times on the same (embedded) Drillbit, 10 with “plain old Java” compilation, 10 with normal Drill compilation. (“Plain old Java” just means having each generated class extend its’ template, then using the byte codes directly from the compiler with no ASM-based byte-code manipulations. See DRILL-5052.)
Results:
Traditional: 3259 ms
Straight-Java: 2488 ms
Savings: 1041 ms or 30% savings
Even more interesting, we can time the actual code gen & compile step. Virtually the entire cost is the byte code fixup. Using the traditional approach, cost is about 1500 ms per code gen/compile/bytecode fixup. Using plain-old Java, the cost drops 90% to about 150 ms.
Next we can go all-in and convert all operators in this particular query to use “plain-old Java”. This reduced run time further, to 2232 ms, for another 250 ms savings (about 10%).
Going against conventional wisdom a bit more, we can select the JDK compiler instead of Janino. Result: 2183 ms, for another 50 ms improvement. Not big, but it shows that the JDK compiler is not bad: perhaps because the JDK compiler does not write code to a file before compiling.
The net result is an improvement of 1076 ms per run, or 33% reduction.
As a sanity check, code caching was enabled again. The 10-run average is now 1817 ms. (The original 10-run average time, with caching and “stock" settings, was 2124 ms.)
After these changes, selection vector remover setup time now accounts for 23% of total time, the JSON scan accounts for 32%, which is an improvement, but still provides opportunities for further speed-ups.
The savings above will be less for queries with fewer columns. Of course the flip side is that the savings will be greater for more columns.
“That’s all fine,” you may say, “but we do byte-code fixup for a reason!” To check this, we can use a 2M rows, 300 MB data set and tried out the the original external sort. With original settings, the 10-run average is 15.2 seconds. Switching to “plain-old Java” and the JDK compiler produced an average run time of 15.2 seconds — any difference between the two cases is lost in the noise.
It turns out that doing scalar replacement in byte code fixup saves nothing with modern Oracle JVMs - the JVM (or compiler) already does the scalar replacement for us.
Net takeaway: using plain-old Java (without byte code fixup) and the JDK compiler produces better performance for queries with many columns, and speeds up code generation. At the same time, the technique does not degrade performance for queries with large numbers of rows.
In the above, the selection vector remover accounts for a significant amount of the time. We can optimize this using a “generic” code with loops rather than custom-generated code that attempts to unroll a loop. It turns out, at least for large column counts, the loop version is faster. The total result is that this particular query is now 2.5x faster than on Drill 1.9.
Best previous average: 2183 ms
Instrumenting actual compile time, two “long poles” stood out:
Compile Time for org.apache.drill.exec.test.generated.CopierGen3: 500
Compile Time for org.apache.drill.exec.test.generated.CopierGen8: 1659
That is quite a bit of time for a 5-second query! Much of the initial run time of 5578 ms is taken up in compiling two classes (2159 ms).
The classes themselves are very simple: create member variables for 486 vectors (2 x column count), and call a method on each to do the copy. The only type-specific work is the member variable and call to the (non-virtual) CopyFrom or CopyFromSafe methods. The generated class can easily be replaced by a “generic” class and virtual functions in the vector classes to choose the correct copy method.
Clearly, avoiding code gen means avoiding the compile times with a first-run savings. Here are the last 8 runs (out of 10), with code cached turned off (forcing a compile on each query run), with and without the generic versions:
Original (no code cache): 1832 ms / run
“Pre-written” (no code cache): 1317 ms / run
So, yes, avoiding compilation of generated code saves ~500 ms per run (or 28%). Not a big surprise.
But, the whole reason for generating the code is that we all know that 243 in-line statements (an unwound loop) is faster than a loop with 243 iterations, right? Worse, the generic version uses an array in place of ~500 variables, and a virtual function call rather than in-line, type-specific calls. Can’t’ be good.
Let’s repeat the exercise, this time with the code cache turned on so that we pay no compile cost for either code path (because we exclude the first two runs in which the generated code is compiled.)
Original: 1302 ms / run
Generic version: 1186 ms / run
So, seems someone forgot to tell the JVM that an unrolled loop is faster. In this instance, the array/loop/virtual function version is ~110 ms faster (9%).
So, we can simplify the code, eliminate a costly code-gen and compile step, and still go faster. Plus, since we’re removing generated classes from the code cache, there is more room for the remaining classes, which may improve the hit rate.
Getting serious, we can rewrite the new vector methods to avoid two levels of function calls. Average time drops to 1040 ms / run, another 12% savings. (Excluding the first two runs, as was done above.) So, even with a model JVM, avoiding a method call in an inner loop is a “good thing."
This now brings the total improvement on the "short,fat" query to:
Original: 3259 ms / run
Revised: 1280 ms / run
(These numbers do include the first two runs in the averages.) The net improvement is now 61%. The query is now 2.5x faster than when we started.
Total setup time is now less than 4% of the run time (down from ~60% originally). The JSON scan now takes 67% of the time. That pesky selection vector remover is still 21% of run time. This query could go faster by skipping the 6000-row SV remover completely, but that is another story.
This exercise focused only on the sv2 and sv4 “copiers” for the selection vector remover. The same trick may not work elsewhere, and certainly won’t work for code that contains user-defined expressions.
DRILL-5125 requests to productize these improvements.