Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-5126: Provide simplified, unified "cluster fixture" for test #710

Closed
wants to merge 1 commit into from

Conversation

paul-rogers
Copy link
Contributor

Drill provides a robust selection of test frameworks that have evolved to satisfy the needs of a variety of test cases. However, some do some of what a given test needs, while others to other parts. Also, the various frameworks make assumptions (in the form of boot-time configuration) that differs from what some test may need, forcing the test to start, then stop, then restart a Drillbit - an expensive operation.

Also, many ways exist to run queries, but they all do part of the job. Several ways exist to change runtime options.

This checkin shamelessly grabs the best parts from existing frameworks, adds a fluent builder facade and provides a complete, versitie test framework for new tests. Old tests are unaffected by this new code.

An adjustment was made to allow use of the existing TestBuilder mechanism. TestBuilder used to depend on static members of BaseTestQuery. A "shim" allows the same code to work in the old way for old tests, but with the new ClusterFixture for new tests.

Details are in the org.apache.drill.test.package-info.java file.

This commit modifies a single test case, TestSimpleExternalSort, to use the new framework. More cases will follow once this framework itself is committed.

Also, the framework will eventually allow use of the extended mock data source from SQL. However, that change must await checkin of the mock data source changes.

@@ -458,6 +458,12 @@
<artifactId>httpdlog-parser</artifactId>
<version>2.4</version>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* @return this builder
*/
public FixtureBuilder maxParallelization(int n) {
return sessionOption(ExecConstants.MAX_WIDTH_PER_NODE_KEY, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should return Builder here and set the sessionOption ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, we do. sessionOption() is a fluent method to set the session option that returns the builder which we just pass along above.

* Create the embedded Drillbit and client, applying the options set
* in the builder. Best to use this in a try-with-resources block:
* <pre><code>
* FixtureBuilder builder = ClientFixture.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be clusterFixture ? Please correct below as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A case of the comment not staying current with the code... Also revised the example to show that you have to create both the cluster and client in the current version of the code.

if (builder.bitNames != null && i < builder.bitNames.length) {
name = builder.bitNames[i];
} else {
if (i == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove (i == 0) check and always set name as "DEFAULT_BIT_NAME} + Integer.toString(i+1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. But this would mean that in 99% of the tests, the name would be "drillbit1" rather than "drillbit". And, to make it easier to add a second Drillbit to an existing test, if there are two this names them "drillbit" and "drillbit2". Can't recall if I did this to be compatible with older test frameworks.

Did clean up the condition and add an explanation.

// Set the default parallelization unless already set by the caller.

if ( ! sawMaxWidth ) {
clientFixture( ).alterSession( ExecConstants.MAX_WIDTH_PER_NODE_KEY, MAX_WIDTH_PER_NODE );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set this option by default in the sessionOptions List so that we don't have to explicitly check for this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Didn't originally have session options in the builder. But, now that they exist we should use them...

QueryId queryId = null;
loop:
for ( ; ; ) {
QueryEvent event = listener.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get NPE here since get returns "null" in case of exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? The listener handles errors as follows:

  @Override
  public void submissionFailed(UserException ex) {
    silentPut( new QueryEvent( ex ) );
  }

That is, when an error arrives from the server, we create an event and put it into the event queue, later to be retrieved by the above get. Perhaps show me the error case you discovered as I'm likely missing something here.

I've seen other tests where any failure is delivered using the submissionFailed call...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next commit will handle this by wrapping the InterruptedException in an event and returning it. That particular exception should never occur in a test case, but no harm in being complete.


public String getScan( ) {
int n = getPlans( ).size();
Pattern p = Pattern.compile( "\\d+-\\d+\\s+(\\w+)\\(" );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are just trying to find if this plan has Scan keyword after fragment number. Can't we simply use plan.contains("Scan") ? I guess the scan keyword will only be in plans which are actually Scan's.?

Also there can be multiple Scans in the plan right ? In which case we should return list of plans here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What we should do is provide the JSON version of the query profile in the UI so that customers/users can send us that form instead of the text form. With the JSON form, parsing is much simpler.

Yes, this can be simplified. Originally grabbed the other information such as the numbers, but not now. Also, was being paranoid incase the word "Scan" showed up as a table or field name, but "Scan(" is probably safe.

And, yes, in the general case there are multiple scans. Changed that also.

if ( m.group(2) == null ) {
length = 0;
} else {
length = Integer.parseInt( m.group(2) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a comment that group(2) is for VARCHAR types which has field length. e.g.: VARCHAR(2147483647)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. There is much to do still here. This is a work in progress that needs expansion as we use it for more test cases.


System.out.println(String.format("Sorted %,d records in %d batches.", recordCount, batchCount));
private void chooseImpl(ClientFixture client, boolean testLegacy) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the implementation for this be added once the new sort mechanism is checked in ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is a placeholder. Once the managed sort and this framework are both committed, then this test will be replaced with one that tests both the original and managed version. This method switches between them.

public Map<Integer,String> getOperators( ) {
Map<Integer,String> ops = new HashMap<>();
int n = getPlans( ).size();
Pattern p = Pattern.compile( "\\d+-(\\d+)\\s+(\\w+)" );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified all the regex with a sample plan from a query profile. Looks good!

@paul-rogers
Copy link
Contributor Author

Rebased onto master. Pulled in additional changes from development branch that have accumulated in the three weeks that this PR has worked its way though the process. See the description in the third commit for details.

appender.setEncoder( ple );
appender.start();

Logger root = (Logger)LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are checking only for console appender in DrillLogger. Can't root logger have the console appender and drillLogger may not ? In which case we will overwrite the console appender of root logger and while restoring we will remove the console appender from both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the semantics, each logger can have multiple appenders. We add ours to any that exist. When we remove ours, we only remove that one, not any others that might be registered. Of course, I still might be missing something...

@Override
public void submissionFailed(UserException ex) {
future.completed(new QuerySummary(queryId, recordCount, batchCount,
System.currentTimeMillis() - startTime, ex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* Synchronizes the listener thread and the test thread that
* launched the query.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove extra space here and other places too. Like line 120

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the other way: consistent white space before & after Javadoc comments.


public boolean failed() { return error != null; }
public boolean succeeded() { return error == null; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting here ?

@sohami
Copy link
Contributor

sohami commented Jan 20, 2017

+1 LGTM

@sudheeshkatkam
Copy link
Contributor

+1, thanks!

Drill provides a robust selection of test frameworks that have evolved to satisfy the needs of a variety of test cases.
However, some do some of what a given test needs, while others to other parts. Also, the various frameworks make
assumptions (in the form of boot-time configuration) that differs from what some test may need, forcing the test
to start, then stop, then restart a Drillbit - an expensive operation.

Also, many ways exist to run queries, but they all do part of the job. Several ways exist to channge
runtime options.

This checkin shamelessly grabs the best parts from existing frameworks, adds a fluent builder facade
and provides a complete, versitie test framework for new tests. Old tests are unaffected by this
new code.

An adjustment was made to allow use of the existing TestBuilder mechanism. TestBuilder used to
depend on static members of BaseTestQuery. A "shim" allows the same code to work in the old
way for old tests, but with the new ClusterFixture for new tests.

Details are in the org.apache.drill.test.package-info.java file.

This commit modifies a single test case, TestSimpleExternalSort, to use the new framework.
More cases will follow once this framework itself is committed.

Also, the framework will eventually allow use of the extended mock data source
from SQL. However, that change must await checkin of the mock data source changes.

Includes a LogFixture that allows setting logger options per test to simplify debugging via tests.

Also includes a “summary listener” to run a query and return a summary of the
run. Handy to simply verify that a query runs and to time it.

Added an async query runner for tests that want to run multiple
concurrent queries.
@paul-rogers
Copy link
Contributor Author

Rebased on master, resolved conflict and squashed commits.

@asfgit asfgit closed this in 5c3924c Jan 30, 2017
vdiravka added a commit to vdiravka/drill that referenced this pull request Jun 7, 2018
* Updated plugin version to 1.7.0-SNAPSHOT

* MD-789: Query with condition involving addition of DATE and INTERVAL types returns no results [MapR-DB JSON Tables]

+ Added `enablePushdown` option to enable/disable all filter pushdown, enabled by default.

* MD-813: Improve count(*) queries against MapR-DB Json tables.

+ Fail query on schema change.
+ Added a configuration option 'ignoreSchemaChange', which when enabled, drops the rows from the result

* Update MapR v5.1.0 artifacts version

* DRILL-4199: Add Support for HBase 1.X

* Updated plugin version to 1.9.0-SNAPSHOT

* Explicitly specify `hbase-server` dependency in Hive storage plugin.

Hive's HBaseStorageHandler uses HBase's TableInputFormat which is in hbase-server module.

* Exclude 'drill-memory-base' and 'tpch-sample-data' jars from jars/3rdparty folder.

* DRILL-4886: Modifying projects POMs to align with Drill's build and distribution.

The default build/test/packaging behavior for mapr-format-plugin module are

1. BUILD of mapr-format-plugin is ENABLED.
2. Unit tests of mapr-format-plugin module are DISABLED (use `-Pmapr` to enable).
3. Packaging of mapr-format-plugin is DISABLED (use `-Pmapr` to enable).

Please see LEGAL-251 for discussion/conclusion regarding inclusion of source code with non-open-source dependency.

* DRILL-4894: Fix unit test failure in 'storage-hive/core' module

Exclude 'hadoop-mapreduce-client-core' and 'hadoop-auth' as transitive dependencies from 'hbase-server'

* DRILL-3898 :  Sort spill was modified to catch all errors, ignore repeated errors while closing the new group and issue a more detailed error message.

close apache/drill#591

* DRILL-4771: Drill should avoid doing the same join twice if count(distinct) exists

close apache/drill#588

* DRILL-4906: CASE Expression with constant generates class exception

Tests for different data types

close apache/drill#598

* DRILL-4911:  Avoid plan serialization in SimpleParallelizer when debug logging is not enabled.

* DRILL-4618: Correct the usage of random flag in Hive function registry

+ Function visitor should not use previous function holder if this function is non-deterministic

closes #509

* DRILL-4862: Fix binary_string to use an injected buffer as out buffer

+ Previously, binary_string used the input buffer as output buffer. So after calling binary_string, the original content was destroyed. Other expressions/ functions that need to access the original input buffer get wrong results.
+ This fix also sets readerIndex and writerIndex correctly for the output buffer, otherwise the consumer of the output buffer will hit issues.

closes #604

* DRILL-4927: Add support for Null Equality Joins

These changes are a subset of the original pull request from DRILL-4539 (PR-462).
- Added changes to support Null Equality Joins;
- Created tests for it.

close apache/drill#603

* DRILL-4452: Uses Apache Calcite Avatica driver vs Optiq driver for Drill JDBC

Drill JDBC driver uses Optiq Avatica as its basis, but this dependency has
been moved over to Calcite, for quite some time without Drill code being
updated for it.

This patch updates Avatica version to the version from Calcite
(1.4.0-drill-r19). It also refactors Drill JDBC driver to comply with the
packages and API changes in Avatica. Finally it fixes the the SQL types for
lists and structs, since Drill doesn't support java.sql.Array and
java.sql.Struct interfaces.

this closes #395

Change-Id: Ia608adf900e8708d9e6f6f58ed41e104321a9914

* DRILL-4880: Support JDBC driver registration using ServiceLoader

Support loading Drill driver using ServiceLoader. From the user perspective,
it means being able to use the driver without registering it first, like by using
Class.forName("org.apache.drill.jdbc.Driver") for example.

this closes #596

Change-Id: Id26922ee42bef5fbce46ac2bcbb83f1859e9bb48

* DRILL-4930: Fix Metadata results ordering

Change MetadataProvider to return metadata results ordered (following
convention used by ODBC and JDBC specs).

this closes #614

Change-Id: Iff59b7fada7040602f1735bccc13bc6bf5c9a252

* DRILL-4925: Add tableType filter to GetTables metadata query

- Adding tableType filter to GetTablesReq query (needed for JDBC and ODBC
drivers).
- Fix table type returned by sys and INFORMATION_SCHEMA tables
- Also fixes some protobuf typos to related classes.

this closes #612

Change-Id: If95246a312f6c6d64a88872936f516308874c2d2

* DRILL-4870 drill-config.sh sets JAVA_HOME incorrectly for the Mac

This closes #605

* DRILL-4203: Fix DrillVersionInfo to make it provide a valid version number even during the unit tests.

This is now a build-time generated class, rather than one that looks on the
classpath for META-INF files.

This pattern for file generation with parameters passed from the POM files
was borrowed from parquet-mr.

* DRILL-4203: Fix date values written in parquet files created by Drill

Drill was writing non-standard dates into parquet files for all releases
before 1.9.0. The values have been read by Drill correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.

This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.

To maintain compatibility with old files, the parquet reader code has
been updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.

The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values they
had in previous releases. For compatibility with external tools, any old
files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files).

While the old behavior was a consistent shift into an unlikely range
to be used in a modern database (over 10,000 years in the future), these are still
valid date values. In the case where these may have been written into
files intentionally, and we cannot be certain from the metadata if Drill
produced the files, an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included
for completeness.

This patch was originally written against version 1.5.0, when rebasing
the corruption threshold was updated to 1.9.0.

Added regenerated binary files, updated metadata cache files accordingly.

One small fix in the ParquetGroupScan to accommodate changes in master that changed
when metadata is read.

Tests for bugs revealed by the regression suite.

Fix drill version number in metadata file generation

* DRILL-4203: Parquet File. Date is stored wrongly - Added new extra field in the parquet meta info "is.date.correct = true"; - Removed unnecessary double conversion of value with Julian day; - Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well.

This closes #595

* DRILL-4653: Malformed JSON should not stop the entire query from progressing

This closes #518

* DRILL-4726: Dynamic UDF Support

1) Configuration / parsing / options / protos
2) Zookeeper integration
3) Registration / unregistration / lazy-init
4) Unit tests

This closes #574

* DRILL-3178: csv reader should allow newlines inside quotes

This closes #593

* DRILL-4369: Exchange name and version infos during handshake

There's no name and version exchanged between client and server over the User RPC
channel.

On client side, having access to the server name and version is useful to expose it
to the user (through JDBC or ODBC api like DatabaseMetadata#getDatabaseProductVersion()),
or to implement fallback strategy when some recent API are not available (like
metadata API).

On the server side, having access to the client version might be useful for audit
purposes and eventually to implement fallback strategy if it doesn't require a RPC
version change.

this closes #622

* DRILL-4950: Remove incorrect false condition; consume all empty batches

closes #621

* DRILL-4921: Added -P to "cd" in scripts to allow for symbolic links

close apache/drill#623

* DRILL-4954: Fix printing Strings when `allTextMode` is ON for MapR-DB JSON tables.

close apache/drill#624

* DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the numbers of records read

close apache/drill#597

* DRILL-4560: When new bits register, invoke listeners in ZKClusterCoordinator

close apache/drill#626

Original JIRA title: “ZKClusterCoordinator does not call
DrillbitStatusListener.drillbitRegistered for new bits”

* DRILL-4968: Add column size to ColumnMetadata

Add a column size to ColumnMetadata so that JDBC and ODBC clients share
the implementation and don't have to recompute it client side.

this closes #631

* DRILL-4826: Query against INFORMATION_SCHEMA.TABLES degrades as the number of views
increases

This closes #592

* DRILL-4964: Make Drill reconnect to hive metastore after hive metastore is restarted.

Drill fails to connect to hive metastore after hive metastore is restarted unless drillbits are restarted.

Changes: For methods DrillHiveMetaStoreClient.getAllDatabases() and DrillHiveMetaStoreClient.getAllTables(),
the HiveMetaStoreClient wraps MetaException and TException both into MetaException. In case of connection
failure which is thrown as TException it is difficult to categorize at DrillClient level. The fix is to
close older connection and reconnect in case of these 2 api's. In all other cases proper set of exceptions
are thrown where we can handle each one individually.

close apache/drill#628

* DRILL-4972: Remove setDaemon(true) call in WorkManager.StatusThread

closes #633

* DRILL-4974: Add missing null check in FindPartitionConditions.analyzeCall()

close apache/drill#634

* DRILL-4967: Adding template_name to source code generated using freemarker template.

close apache/drill#629

* DRILL-4884: Fix IOB exception in limit n query when n is beyond 65535.

close apache/drill#584

* DRILL-3423: Initial HTTPD log plugin. Needs tests. Would be good to improve the timestamp and cookies behaviors since we can make those more type specific.

* DRILL-3423: Adding HTTPd Log Parsing functionality including full pushdown, type remapping and wildcard support.
 Pushed through the requested columns for push down to the parser. Added more tests to cover a few more use cases. Ensured that user query fields are now completely consistent with returned values.

* DRILL-3243: Added CSG mods. Fixed field names.
Removed old test files
Added Parse_url and parse_query() functions
Fix unit test

This closes #607

* DRILL-4373: Drill and Hive have incompatible timestamp representations in parquet - added sys/sess option "store.parquet.int96_as_timestamp"; - added int96 to timestamp converter for both readers; - added unit tests;

This closes #600

* DRILL-4420: C++ API for metadata access and prepared statements

Add support to the C++ client for metadata querying and prepared
statement requests.

Part of the metadata API, add methods to query for server capabilities.
As of now, this interface is not backed up by any RPC exchange so
the information is pretty much static, and match Drill 1.8.0
current capabilities.

* DRILL-4853: Update C++ protobuf source files

Add support for prepared statements and metadata querying

* DRILL-1268: Add unit test to C++ native client

Add CppUnit unit test to the C++ native client

* DRILL-1996: Add cancel method to Drill C++ connector

This closes #602

* DRILL-4792: Include session options used for a query as part of the profile

closes #551

* DRILL-4951: Do Guava.patch earlier so we can run single Hbase's unit test through command line or IDE

closes #636

* DRILL-4927 (part 2): Add support for Null Equality Joins (mixed comparators)

This changes are a subset of the original pull request from DRILL-4539 (PR-462).
- Added changes to support mixed comparators;
- Added tests for it.

closes #635

* DRILL-4986: Allow users to customize the Drill log file name

Adds a new user-settable environment variable to choose the log file
base name.

closes #641

* DRILL-4989: Fix TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange

closes #643

* DRILL-4800: Use a buffering input stream in the Parquet reader

* DRILL-4800: Add AsyncPageReader to pipeline PageRead Use non tracking input stream for Parquet scans. Make choice between async and sync reader configurable. Make various options user configurable - choose between sync and async page reader, enable/disable fadvise Add Parquet Scan metrics to track time spent in various operations

* DRILL-4800: Parallelize column reading. Read/Decode fixed width fields in parallel Decoding var length columns in parallel Use simplified decompress method for Gzip and Snappy decompression. Avoids concurrency issue with Parquet decompression. (It's also faster). Stress test Parquet read write Parallel column reader is disabled by default (may perform less well under higher concurrency)

* DRILL-4800: Various fixes. Fix buffer underflow exception in BufferedDirectBufInputStream. Fix writer index for in64 dictionary encoded types. Added logging to help debug. Fix memory leaks. Work around issues with of InputStream.available() ( Do not use hasRemainder; Remove check for EOF in BufferedDirectBufInputStream.read() ). Finalize defaults. Remove commented code.
Addressed review comments

This closes #611

* DRILL-5400: Fix NPE in corrupt date detection

This closes #646

* DRILL-1950: Initial prototype patch for parquet filter pushdown.

Use three new classes from Adam's patch.

* DRILL-1950: Update parquet metadata cache format to include both min/max and additional column type information.

    Parquet meta cache format change:
    1. include both min/max in ColumnMetaData if column statistics is available,
    2. include precision/scale/repetitionLevel/definitionLevel in ColumnTypeMetaData (precision/scale/definitionLevel is for future use).

* DRILL-1950: Parquet rowgroup level filter pushdown in query planning time.

Implement Parquet rowgroup level filter pushdown. The filter pushdown is performed in
in Drill physical planning phase.

Only a local filter, which refers to columns in a single table, is qualified for filter pushdown.

A filter may be qualified if it is a simple comparison filter, or a compound "and/or" filter consists of
simple comparison filter. Data types allowed in comparison filter are int, bigint, float, double, date,
timestamp, time. Comparison operators are =, !=, <, <=, >, >=. Operands have to be a column of the above
data types, or an explicit cast or implicit cast function, or a constant expressions.

This closes #637

* DRILL-4945: Report INTERVAL exact type as column type name

closes #618

* DRILL-4969: Basic implementation of display size for column metadata

Add a simple implementation of display size, based on the ODBC table
available at: https://msdn.microsoft.com/en-us/library/ms713974(v=vs.85).aspx

closes #632

* DRILL-4823: Fix OOM while trying to prune partitions with reasonable data size

closes #560

* DRILL-4674: Allow casting to boolean the same literals as in Postgres

closes #610

* Add Sudheesh's PGP Key

* DRILL-5007: Dynamic UDF lazy-init does not work correctly in multi-node cluster

closes #650

* DRILL-5009: Skip reading of empty row groups while reading Parquet metadata

+ We will no longer attempt to scan such row groups.

closes #651

* DRILL-5047: When session option is string, query profile is displayed incorrectly on Web UI

closes #655

* DRILL-4935: Allow drillbits to advertise a configurable host address to
Zookeeper

This closes #647

* DRILL-4979: Make port of the DataConnection configurable

This closes #649

* DRILL-4831: Running refresh table metadata concurrently randomly fails with JsonParseException

This closes #653

* DRILL-4980: Upgrading of the approach of parquet date correctness status detection - Parquet writer version is added; - Updated the detection method of parquet date correctness.

This closes #644

* DRILL-4995: Allow lazy init when dynamic UDF support is disabled

This closes #645

* Update version to 1.10.0-SNAPSHOT

* DRILL-4604: Generate warning on Web UI if drillbits version mismatch is detected

close apache/drill#482

* DRILL-5094: Comparator should guarantee transitive attribute.

close apache/drill#675

* DRILL-5086: Fix conversion of min and max values to appropriate data type.

close apache/drill#674

* DRILL-5050: C++ client library has symbol resolution issues when loaded by a process that already uses boost::asio

Build with Boost static libs and drill_boost namespace on mac. Added
readme with instructions

DRILL-5050: Addressed review comments

DRILL-5050: address more review comments

close apache/drill#659

* DRILL-4982: Separate Hive reader classes for different data formats to improve performance.

1, Separating Hive reader classes allows optimization to apply on different classes in optimized ways. This  separation effectively avoid the performance degradation of scan.

2, Do not apply Skip footer/header mechanism on most Hive formats. This skip mechanism introduces extra checks on each incoming records.

close apache/drill#638

* DRILL-5015: Randomly select the drillbit from the list provided by user in connection string
            Note: Improved the connection string validation and error handling during parsing.
                  Added unit test for the new parsing mechanism.

close apache/drill#648

* DRILL-5091: Handle new Java 8 JDBC functions and add missing test config parameter

* JDBC unit test fails to set up test storage plugin config on Java 8
* Cleans up compiler warnings seen in Eclipse.
* Added TODO's based on code reviews

closes #676

* DRILL-5108: Reduce output from Maven git-commit-id-plugin

The git-commit-id-plugin grabs information from Git to display
during a build. It prints many e-mail addresses and other generic
project information. As part of the effort to trim down unit test
output, we propose to turn off the verbose output from this plugin
by default.

closes #680

* DRILL-5113: Upgrade Maven RAT plugin to avoid annoying XML errors

Upgrade to eliminate XML compiler warnings that appear for each
sub-project in the Drill build.

Also fixes a Maven warning about a duplicated version error for
the dependency plugin (version is inherited from root pom.xml).

closes #682

* DRILL-5048: Fix type mismatch error in case statement with null timestamp

closes #657

* DRILL-5044: Fix retry logic to handle VersionMismatchException by not deleting jars in remote UDFs area

closes #669

* DRILL-5085: Update description for dynamic UDFs directories in drill-env.sh and drill-module.conf

closes #672

* DRILL-5112: Fix config in PopUnitTestBase

Tests rely on command-line settings in the pom.xml file. Those
settings are not available when tests are run in Eclipse.
Replicated required settings into the base test class (as in
BaseTestQuery).

closes #681

* DRILL-4812: Fix FileSelection#handleWildCard to use normalized path separator

closes #627

* DRILL-5056: Fix UserException to write full message to log

A case occurred in which the External Sort failed during spilling.
All that was written to the log was:

2016-11-18 ... INFO  o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred

Modified the logging code to provide more information to aid tracking down problems that occur in the field.

closes #665

* DRILL-5119: Update MapR version to 5.2.0.40963-mapr

closes #688

* DRILL-4987: Use ImpersonationUtil to get process user’s groups in RemoteFunctionRegistry

closes #642

* DRILL-5081: Lower logging level for corrupt dates message

* introduced in DRILL-4203

closes #691

* DRILL-4938: Report UserException when constant expression reduction fails

closes #689

* DRILL-5065: Optimize count(*) queries on MapR-DB JSON Tables

In MapR-DB v5.2.0, we enabled '_id' only projection for JSON
tables. Hence, we can now optimize the following queries:

a. count(*) by projecting only the '_id' column.

b. '_id' only projections, including count(_id)

Change the format plugin config parameter name.

Fix setter of config parameter `disableCountOptimization` for drill-maprdb plugin

closes #678

* DRILL-5123: Write query profile after sending final response to client to improve latency

In testing a particular query, I used a test setup that does not write
to the "persistent store", causing query profiles to not be saved. I
then changed the config to save them (to local disk). This produced
about a 200ms difference in query run time as perceived by the client.

I then moved writing the query profile after sending the client the
final message. This resulted in an approximately 100ms savings, as
perceived by the client, in query run time on short (~3 sec.) queries.

close apache/drill#692

* DRILL-5117: Compile error when query a json file with 1000+columns

close apache/drill#686

* DRILL-5098: Improving fault tolerance for connection between client and foreman node.

Adding tries config option in connection string. Improving fault tolerance in Drill client when trying to make first connection with foreman. The client will try to connect to min(tries, num_drillbits) unique drillbits unless a successfull connection is established.

HYGIENE: Refactoring BasicClient::close to call RemoteConnection::close

close apache/drill#679

* DRILL-5051: Fix incorrect computation of 'fetch' in LimitRecordBatch when 'offset' is specified

close apache/drill#662

* DRILL-5032: Drill query on hive parquet table failed with OutOfMemoryError: Java heap space

close apache/drill#654

* DRILL-5052: Option to debug generated Java code using an IDE

Provides a second compilation path for generated code: “plan old Java”
in which generated code inherit from their templates. Such code can be
compiled directly, allowing easy debugging of generated code.

Also show to generate two classes in the External Sort Batch as “plain
old Java” to enable IDE debugging of that generated code. Required
minor clean-up of the templates.

Fixes some broken toString( ) methods in code generation classes
Fixes a variety of small compilation warnings
Adds Java doc to a few classes

Includes clean-up from code review comments.

close apache/drill#660

* DRILL-5159: Drill's ProjectMergeRule should operate on RelNodes with same convention trait.

close apache/drill#705

* DRILL-5127: Revert the fix for DRILL-4831

close apache/drill#718

* DRILL-5121 Fix for memory leak. Changes fieldVectorMap in ScanBatch to a CaseInsensitiveMap

close apache/drill#690

* DRILL-5039: NPE - CTAS PARTITION BY (<char-type-column>)

close apache/drill#706

* DRILL-5116: Enable generated code debugging in each Drill operator

DRILL-5052 added the ability to debug generated code. The reviewer suggested
permitting the technique to be used for all Drill operators. This PR provides
the required fixes. Most were small changes, others dealt with the rather
clever way that the existing byte-code merge converted static nested classes
to non-static inner classes, with the way that constructors were inserted
at the byte-code level and so on. See the JIRA for the details.

This code passed the unit tests twice: once with the traditional byte-code
manipulations, a second time using "plain-old Java" code compilation.
Plain-old Java is turned off by default, but can be turned on for all
operators with a single config change: see the JIRA for info. Consider
the plain-old Java option to be experimental: very handy for debugging,
perhaps not quite tested enough for production use.

close apache/drill#716

* DRILL-4996: Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6

- Changed detection approach of corrupted date values for the case, when parquet files are generated by drill:
  the corruption status is determined by looking at the min/max values in the metadata;
- Appropriate refactoring of TestCorruptParquetDateCorrection.

This closes #687

* DRILL-5105: comment out unecessary recursive buffer size check

This closes #715

* DRILL-5152: Enhance the mock data source: better data, SQL access

Provides an enhanced version of the mock data source. See the JIRA
entry for motivation, package-info.java for details of operation.

Revisions suggested by code review

Also includes additional comments and a few more compiler warning
cleanups.

This closes #708

* DRILL-4919: Fix select count(1) / count(*) on csv with header

This closes #714

* DRILL-4558: BSonReader should prepare buffer size as actual need

This closes #696

* DRILL-4868: fix how hive function set DrillBuf.

This closes #695

* DRILL-5172: Display elapsed time for queries in the UI

Displays the elapsed time for running queries and the total duration of completed/failed/cancelled queries in the list of query profiles displayed, and within a query's profile page as well.
The query runtime is  displayed in '[hr] [min] sec'.
e.g. A duration of 25,254,321ms is displayed  7 hr 00 min 54.321 sec

This closes #721

* DRILL-4956: Temporary tables support

close apache/drill#666

* DRILL-5097: Using store.parquet.reader.int96_as_timestamp gives IOOB whereas convert_from works

close apache/drill#697

* DRILL-5104: Foreman should not set external sort memory for a physical plan

Physical plans include a plan for memory allocations. However, the code
path in Foreman replans external sort memory, even for a physical plan.
This makes it impossible to use a physical plan to test memory
configuration.

This change avoids changing memory settings in a physical plan; while
preserving the adjustments for logical plans or SQL queries.

Revised to put a property in the plan itself. Old plans, and those
generated from SQL, will have memory allocations applied. Plans
marked as already "resource management" planned will be used as-is.

Includes a unit test that demonstrates the new behavior.

close apache/drill#703

* DRILL-5164: Equi-join query results in CompileException when inputs have large number of columns.

close apache/drill#711

* DRILL-5218: Support optionally disabling heartbeats from C++ client

closes #726

* DRILL-4764: Parquet file with INT_16, etc. logical types not supported by simple SELECT

closes #673

* DRILL-5126: Provide simplified, unified "cluster fixture" for test

Drill provides a robust selection of test frameworks that have evolved to satisfy the needs of a variety of test cases.
However, some do some of what a given test needs, while others to other parts. Also, the various frameworks make
assumptions (in the form of boot-time configuration) that differs from what some test may need, forcing the test
to start, then stop, then restart a Drillbit - an expensive operation.

Also, many ways exist to run queries, but they all do part of the job. Several ways exist to channge
runtime options.

This checkin shamelessly grabs the best parts from existing frameworks, adds a fluent builder facade
and provides a complete, versitie test framework for new tests. Old tests are unaffected by this
new code.

An adjustment was made to allow use of the existing TestBuilder mechanism. TestBuilder used to
depend on static members of BaseTestQuery. A "shim" allows the same code to work in the old
way for old tests, but with the new ClusterFixture for new tests.

Details are in the org.apache.drill.test.package-info.java file.

This commit modifies a single test case, TestSimpleExternalSort, to use the new framework.
More cases will follow once this framework itself is committed.

Also, the framework will eventually allow use of the extended mock data source
from SQL. However, that change must await checkin of the mock data source changes.

Includes a LogFixture that allows setting logger options per test to simplify debugging via tests.

Also includes a “summary listener” to run a query and return a summary of the
run. Handy to simply verify that a query runs and to time it.

Added an async query runner for tests that want to run multiple
concurrent queries.

closes #710

* DRILL-3562: Query fails when using flatten on JSON data where some documents have an empty array

closes #713

* DRILL-5043: Function that returns a unique id per session/connection similar to MySQL's CONNECTION_ID() #685

* DRILL-5207: Improve Parquet Scan pipelining. Add a configurable AsyncPageReader Queue. Enforce total size of parquet row group. Do not initialize BufferedDirectBufInputStream buffer in init. Wait for first read. Change default size of BufferedDirectBufInputStream. Do not invoke getOptions too many times in Parquet reader. Add metrics for processing time, and decoding time for varlen and fixedlen columns.

This closes #723

* DRILL-5215: CTTAS: disallow temp tables in view expansion logic

This closes #725

* DRILL-5220: Provide API to set application/client names in C++ connector

Add method to DrillClientConfig to set the client and the application names
in the C++ connector.

Allow the ODBC driver (or any user of the C++ connector) to provide more
specific informations like the application using the client.

This closes #728

* DRILL-5224: CTTAS: fix errors connected with system path delimiters (Windows)

This closes #731

* DRILL-5237: FlattenRecordBatch loses nested fields from the schema when returns empty batches for the first time

This closes #735

* DRILL-5238: CTTAS: unable to resolve temporary table if workspace is indicated without schema

This closes #736

* DRILL-5240: Parquet - fix unnecessary object creation while checking for null values in nullable var length columns

This closes #740

* DRILL-5223: Drill should ensure balanced workload assignment at node level in order to get better query performance

This closes #730

* DRILL-5241: JDBC proxy driver: Do not put null value in map

This closes #724

* DRILL-5243: Fix TestContextFunctions.sessionIdUDFWithinSameSession unit test

This closes #743

* DRILL-5219: Relax user properties validation in C++ client

Unlike Java client, C++ client only allows user properties present in a
whitelist. Relax this restriction so that user can add extra properties.

This closes #727

* DRILL-5040: Parquet writer unable to delete table folder on abort

close apache/drill#744

* DRILL-4864: Add ANSI format for date/time functions

DRILL-4864: Add ANSI format for date/time functions(review changes)

close apache/drill#581

* DRILL-5080: Memory-managed version of external sort

Please see JIRA entry for reasons for revision, design spec and list of
changes.

This PR covers the changes to the external sort itself. Tests for this
operator require the test framework in DRILL-5126 and the mock data
source in DRILL-5152. Tests for this operator will be issued as a
separate PR once those two dependencies are committed.

Until then, the new operator is disabled by default. It can be enabled
using drill.sort.external.disable_managed: false.

The operator now spills before receiving a new batch. Revised memory calcs and
merge calcs to make them a bit clearer and provide more margin of error
for the power-of-two allocations used when allocating vectors.

We have two external sort implementations, but only one operator code
for both. They can use only one Metrics enum between them. When adding
new metrics to the new version, didn’t add matching metrics to the old
one. This fixes that issue. (The issue will go away once the old one is
retired.)

Revised memory calculations to reflect limit of 16 MB per vector.
Current revision limits to 16 MB per output batch to be safe. Next
revision will enforce per-vector limits to allow the overall batch to
be larger when possible.

Also simplified the merge-time calculations.

Original code provided only crude methods to learn the size of a record
batch. Adds a "RecordBatchSizer" to provide detailed analysis so the
sort can know the amount of memory used to buffer a batch, the number
of rows, and the expected row width once the rows are copied to a
spill file or the output.

Moved generic spill classes to a separate package.

Created parameters for spill batch size and merge batch size. Separated
these values in code. Deprecated the min, max spill parameters as they
no longer add much value. Minor code rearranging.

Bug fix

Fixes a corner case of merging spilled files in a low-memory condition.

Fixes from code review

close apache/drill#717

* DRILL-5263: Prevent left NLJoin with non scalar subqueries

* DRILL-5230: Translation of millisecond duration into hours is incorrect

Fixed invalid representation of readable elapsed time using `TimeUnit` class in JDK.
e.g. 4545 sec is now correctly translated as `1h15m` instead of `17h15m`

* DRILL-5242: The UI breaks when trying to render profiles having unknown metrics

Skip any metrics whose metric ID is unknown, This prevents any ArrayIndexOutOfBoundsException from being thrown and breaking the UI rendering.

* DRILL-5157: Multiple Snappy versions on class path

Multiple Snappy versions on class path; causes unit test failures.

This fix updates the Snappy library and adds dependency management to
exclude older versions brought in by Avro and Parquet.

* DRILL-5273: CompliantTextReader exhausts 4 GB memory when reading 5000 small files

Please see JIRA for details of problem and fix.

closes #750

* DRILL-5275: Sort spill is slow due to repeated allocations

Rather than create a heap buffer per vector when writing and reading,
the revised code creates a single, shared buffer used for all I/O
within a particular container. This improves performance by reducing GC
and CPU costs during I/Os.

Move I/O buffer, and methods to allocator

Allows the buffer to be shared. Especially in the sort, this is
important, as the sort may have many serializations open at once.

closes #754

* DRILL-5274: Exception thrown in Drillbit shutdown in UDF cleanup code

closes #760

* DRILL-5260: Extend "Cluster Fixture" test framework

- Config option to suppress printing of CSV and other output. (Allows
printing for single tests, not printing when running from Maven.)

- Parsing of query profiles to extract plan and run time information.

- Fix bug in log fixture when enabling logging for a package.

- Improved ZK support.

- Set up the new CTTAS default temporary workspace for tests.

- Clean up persistent storage files on disk to avoid CTTAS startup
failures.

- Provides a set of examples for how to use the cluster fixture.

closes #753

* DRILL-5259: Allow listing a user-defined number of profiles

Allow changing default number of finished queries in web UI, when starting up Drillbits.
Option provided in drill-override.conf (default=100 ; defined in drill-module.conf)
Alternatively, the page can be loaded dynamically for the same.
e.g.
https://<hostname>:8047/profiles?max=100

closes #751

* DRILL-5257: Run-time control of query profiles

Adds a run-time option to save (default) or not save query profiles.

Adds a run-time option to save query profiles in "debug" mode:
that is, after returning the last client response. (Normal mode is
to return the response before writing the profile.)

Tests for normal case are normal unit tests. Tests for debug mode
case are unit tests using the new framework that parse profiles.
The test framework is extended to save query profiles using this
new option.

Modifies the test framework to use the new options when a test
asks to save query profiles.

closes #747

* DRILL-5088: Set client's codec for toJson

closes #702

* DRILL-5196: Init MongoDB cluster when run a single test case directly through command line or IDE

Other fixes include:
+ Sync mongo-java-driver versions to newer 3.2.0
+ update flapdoodle package to latest accordingly

closes #741

* DRILL-5190: Display planning and queued time for a query's profile page

Modified UserSharedBit protobuf for marking planning and wait-in-queue end times. This will allow for accurately reporting the planning, queued and actual execution times of a query.
Planning Time:
In the absence of the planning time's end, for older profiles, the root fragment's (i.e. SCREEN operator) start time is taken as the estimated end of planning time, and as the estimated start time of the execution phase.
QueueWait Time:
We do not estimate the queue time if the planning end time is not available.
Execution Time:
We calculate the execution time based on the availability of these 2 planning time. The computation is done the following way, and reflects a decreasing level of accuracy
1. Execution time = [end(QueueWait) - endTime(Query)]
2. Execution time = [end(Planning) - endTime(Query)]
3. Execution time = [start(rootFragment) - endTime(Query)] - {Estimated}

closes #738

* DRILL-5255: Remove default temporary workspace check at drillbit start up

closes #759

* DRILL-4280: HYGIENE

+ Ignore files generated by IntelliJ in RAT plugin

* DRILL-4280: CORE (Java protocol)

+ Define SaslStatus and SaslMessage messages in protocol
+ Add "authenticationMechanisms" field to all handshakes
+ Add "saslSupport” field to UserToBitHandshake

* DRILL-4280: REFACTOR

+ Extract RemoteConnection interface, and add AbstractRemoteConnection
+ Add ServerConnection and ClientConnection interfaces
+ Add RequestHandler interface to decouple connections from how requests are handled
+ Add NonTransientRpcException

+ Remove unused classes and methods
+ Code style changes

* DRILL-5195: Publish Operator and MajorFragment Stats in Profile page

Improved UI
1. Introduction of Tooltips
2. Share of each operator as a percentages of the major fragment and of the query
  - This would help identify the most CPU intensive operators within a fragment and across the query
3. Rows emitted by each operator
4. For a running query, changes to 'last update' and 'last progress' now shows the elapsed time since.

closes #756

* DRILL-4280: CORE (service login)

+ Support Drillbit login to KDC using Hadoop's
  UserGroupInformation library

+ Set hostname in BootstrapContext
+ Use process user's short name in ImpersonationUtil
+ Add KerberosUtil class

* DRILL-4280: HYGIENE

+ Do not recreate DrillConfig object in PamUserAuthenticator
+ Add new factory method to CaseInsensitiveMap

+ Clean documentation

* DRILL-4280: CORE (security package)

+ Add AuthenticatorFactory interface
+ Kerberos implementation
  + includes SaslServer and SaslClient wrappers
+ Plain implementation
  + PlainServer implements SaslServer (unavailable in Java)
    for username/password based authentication
  + retrofit user authenticator
  + add logic for backward compatibility

+ Add AuthenticatorProvider interface to provide authenticator
  factories, and add two implementations:
  + DrillConfig and ScanResult based AuthenticatorProviderImpl
  + Default and system property based ClientAuthenticatorProvider

+ FastSaslServerFactory caches SaslServer factories
+ FastSaslClientFactory caches SaslClient factories

+ ServerAuthenticationHandler handles authentication on server-side
+ FailingRequestHandler to fail any message received
+ AuthenticationOutcomeListener handles authentication on client-side

security

* DRILL-4280: CORE (user to bit authentication, Java)

+ Add logic for authentication in UserClient and UserServer with
  backward compatibility in both directions
+ Add abstract extension to ServerConnection and ClientConnection
+ Add concrete extensions to abstract connections:
  BitToUserConnection and UserToBitConnection
+ Add ConnectionConfig interface with abstract and concrete
  implementations to encapsulate configuration for server-side
  connections
+ Encapsulate all requests handled by UserServer in
  UserServerRequestHandler

+ Clear UserSession when connection is closed either by user or
  bit
+ Add DrillProperties to encapsulate all connection properties
  used during connection time

* DRILL-4280: CORE (revert DRILL-3242)

+ DRILL-3242 aims to provide offloading request handling to a secondary thread, but this feature is disabled by default due to concurrency issues

+ One of the implications of the feature was to ignore exceptions that were not of UserRpcException type. But exceptions must not be ignored, they should be handled properly, specially in the context of security

* DRILL-4280: CORE (bit to bit authentication, control)

+ Support authentication in ControlServer and ControlClient
+ Add AuthenticationCommand as an initial command after handshake
  and before the command that initiates a connection
+ Add ControlConnectionConfig to encapsulate configuration
+ ControlMessageHandler now implements RequestHandler

control

* DRILL-4280: CORE (bit to bit authentication, data)

+ Support authentication in DataServer and DataClient
+ Add AuthenticationCommand as an initial command after handshake
  and before the command that initiates a connection
+ Add DataConnectionConfig to encapsulate configuration
+ Add DataServerRequestHandler to encapsulate all handling of
  requests to DataServer

data

* DRILL-4280: CORE (web client)

+ Disabled web server when authentication is enabled but PLAIN mechanism
  is not configured; log a warning

* DRILL-4280: CORE (C++ protocol)

* DRILL-4280: CORE (user to bit authentication, C++)

closes #578

* DRILL-4280: CORE (unit tests)

+ Modify existing tests to use new authentication configuration
+ Add TestUserBitKerberos and TestBitBitKerberos using Apache Kerby library

* Bump maxsize of jdbc-all jar to accommodate the increased size of jar file due to new code.

* DRILL-4994: Refactor DrillCursor

Refactor DrillCursor to be more self-contained.

* DRILL-4994: Add back JDBC prepared statement for older servers

When the JDBC client is connected to an older Drill server, it always
attempted to use server-side prepared statement with no fallback.

With this change, client will check server version and will fallback to the
previous client-side prepared statement (which is still limited to only execute
queries and does not provide metadata).

close #613

* DRILL-4730: Update JDBC DatabaseMetaData implementation to use new Metadata APIs

Update JDBC driver to use Metadata APIs instead of executing SQL queries

close #613

* DRILL-5301: Server metadata API

Add a Server metadata API to the User protocol, to query server support
of various SQL features.

Add support to the client (DrillClient) to query this information.

Add support to the JDBC driver to query this information, if the server supports
the new API, or fallback to the previous behaviour (rely on Avatica defaults) otherwise.

close #764

* DRILL-5301: Add C++ client support for Server metadata API

Add support to the Server metadata API to the C++ client if
available. If the API is not supported to the server, fallback
to the previous hard-coded values.

Update the querySubmitter example program to query the information.

close #764

* DRILL-5167: Send escape character for metadata queries

Escape character was not sent when doing metadata queries, which caused
the server to return incorrect results as the pattern is interpreted
differently form what the user asked for.

close #712

* DRILL-5221: Send cancel message as soon as possible in C++ connector

In C++ connector, try to send cancel request to the server as soon as
possible, which means when receiving the queryId or when requested by the
user if queryId has already been received.

close #733

* DRILL-5258: Access mock data definition from SQL

Extends the mock data source to allow using the full power of the mock
data source from an SQL query by referencing the JSON definition
file. See JIRA and package-info for details.

Adds a boolean data generator and a varying-length string generator.

Adds “mock” table stats for use in the planner.

Revisions based on code review comments

close #752

* DRILL-5284: Roll-up of final fixes for managed sort

See subtasks for details.

* Provide detailed, accurate estimate of size consumed by a record batch
* Managed external sort spills too often with Parquet data
* Managed External Sort fails with OOM
* External sort refers to the deprecated HDFS fs.default.name param
* Config param drill.exec.sort.external.batch.size is not used
* NPE in managed external sort while spilling to disk
* External Sort BatchGroup leaks memory if an OOM occurs during read
* DRILL-5294: Under certain low-memory conditions, need to force the sort to merge
two batches to make progress, even though this is a bit more than
comfortably fits into memory.

close #761

* DRILL-5304: Queries fail intermittently when there is skew in data distribution

close #766

* DRILL-4963: Fix issues with dynamically loaded overloaded functions

close #701

* DRILL-5252: Fix a condition that always returns true

close #745

* DRILL-5266: Parquet returns low-density batches

Fixes one glaring problem related to bit/byte confusion.

Includes a few clean-up items found along the way.

Additional fixes from code review comments

More code clean up from code review

close #749

* DRILL-5034: Select timestamp from hive generated parquet always return in UTC

- TIMESTAMP_IMPALA function is reverted to retaine local timezone
- TIMESTAMP_IMPALA_LOCALTIMEZONE is deleted
- Retain local timezone for the INT96 timestamp values in the parquet files while
  PARQUET_READER_INT96_AS_TIMESTAMP option is on

Minor changes according to the review

Fix for the test, which relies on particular timezone

close #656

* DRILL-5290: Provide an option to build operator table once for built-in static functions and reuse it across queries.

close #757

* DRILL-5287: Provide option to skip updates of ephemeral state changes in Zookeeper

close #758

* DRILL-5208: Finding path to java executable should be deterministic

See DRILL-5208 for background. Instead of using “find” to locate the
java command, we use the any information available, resorting to find
only if the “usual suspects” fails. The result is that we use the JDK
java when available, instead of randomly choosing JDK or JRE java.

close #763

* DRILL-5313: Fix compilation issue in C++ connector

DRILL-5301 and DRILL-5167 have conflicting changes, which causes
the C++ connector to not compile: the static symbol for the search
escape string has been removed as the server might use a different one.

Fix the issue by using the current search escape string (injected from the
meta to the internal drill client when querying metadata).

close #769

* DRILL-5293: Change seed for distribution hash function to differ from that of the hash table

close #765

* DRILL-5326: Unit tests failures related to the SERVER_METADTA

 	- adding of the sql type name for the "GENERIC_OBJECT";
	- changing "NullCollation" in the "ServerMetaProvider" to the correct default value;
	- changing RpcType to GET_SERVER_META in the appropriate ServerMethod

close #775

* Update version to 1.11.0-SNAPSHOT

* DRILL-5226: Managed external sort fixes

* Memory leak in managed sort if OOM during sv2 allocation
* "Record batch sizer" does not include overhead for variable-sized
vectors
* Paranoid double-checking of merge batch sizes to prevent OOM when the
sizes differ from expectations
* Revised logging

Addresses review comments

close apache/drill#767

* DRILL-5311: Check handshake result in C++ connector

In C++ client connector, DrillClientImpl::recvHandshake always
return success, even in case of connection error (like a tcp
timeout issue). Only on WIN32 platform would the error code be
checked.

Remove the restriction to only check on WIN32, plus add some logging.

close apache/drill#770

* DRILL-5316: Check drillbits size before we attempt to access the vector element

close apache/drill#772

* DRILL-5330: NPE in FunctionImplementationRegistry

Fixes:

* DRILL-5330: NPE in
FunctionImplementationRegistry.functionReplacement()
* DRILL-5331:
NPE in FunctionImplementationRegistry.findDrillFunction() if dynamic
UDFs disabled

When running in a unit test, the dynamic UDF (DUDF) mechanism is not
available. When running in production, the DUDF mechanism is available,
but may be disabled.

One confusing aspect of this code is that the function registry
is given the option manager, but the option manager is not yet valid
(not yet initialized) in the function registry constructor. So, we
cannot access the option manager in the function registry constructor.

In any event, the existing system options cannot be used to disable DUDF
support. For obscure reasons, DUDF support is always enabled, even when
disabled by the user.

Instead, for DRILL-5331, we added a config option to "really" disable DUDFS.
The property is set only for tests, disables DUDF support.
Note that, in the future, this option could be generalized to
"off, read-only, on" to capture the full set of DUDF modes.
But, for now, just turning this off is sufficient.

For DRILL-5330, we use an existing option validator rather than
accessing the raw option directly.

Also includes a bit of code cleanup in the class in question.

The result is that the code now works when used in a sub-operator unit
test.

close apache/drill#777

* DRILL-5349: Fix TestParquetWriter unit tests when synchronous parquet reader is used.

close apache/drill#780

* DRILL-5352: Profile parser printing for multi fragments

Enhances the recently added ProfileParser to display run times for
queries that contain multiple fragments. (The original version handled
just a single fragment.)

Prints the query in “classic” mode if it is linear, or in the new
semi-indented mode if the query forms a tree.

Also cleans up formatting - removing spaces between parens.

Fixes from review

close apache/drill#782

* Fixed process time percent.
* Added support for getting operator profiles in a multi-fragment query.

* DRILL-5359: Fix ClassCastException when Drill pushes down filter on the output of flatten operator.

- Move findItemOrFlatten as a static method in DrillRelOptUtil.
- Exclude filter conditions if they contain item/flatten operator.

close apache/drill#786

* DRILL-4678: Tune metadata by generating a dispatcher at runtime

main code changes are in Calcite library.
update drill's calcite version to 1.4.0-drill-r20.

close #793

* DRILL-5394: Optimize query planning for MapR-DB tables by caching row counts

close #802

* DRILL-5378: Put more information for schema change exception in hash join, hash agg, streaming agg and sort operator.

close #801

* DRILL-5369: Add initializer for ServerMetaContext

ServerMetaContext had no default constructor. The lack of it
might cause m_done to be set to true, same for other variables.

Add a default constructor to explicitly initialize its members.

close #791

* DRILL-5368: Fix memory leak issue in DrillClientImpl::processServerMetaResult

Fix a small memory leak by doing local allocation instead since the
object doesn't escape the function.

close #790

* DRILL-5351: Minimize bounds checking in var len vectors for Parquet reader

close #781

* DRILL-5297: when the generated plan mismatches, PlanTest print the generated plan along with expected pattern

close #798

* DRILL-4971: Query encounters system error, when there aren't eval subexpressions of any function in boolean and/or expressions

- New evaluated blocks for boolean operators should be with braces always, since they use labels.

close #792

* DRILL-3510: Add ANSI_QUOTES option so that Drill's SQL Parser will recognize ANSI_SQL identifiers

- added supporing of quoting identifiers with DOUBLE_QUOTES or BRACKETS via setting new
  sys/sess EnumString option QUOTING_IDENTIFIERS;
- added possibility of setting QUOTING_IDENTIFIERS by the jdbc connection URL string;
- added relevant unit tests;

close #520

* DRILL-5373: Drill JDBC error in the process of connection via SQuirrel

 -  java.lang.NoClassDefFoundError: javax/validation/constraints/NotNull

* DRILL-5413: DrillConnectionImpl.isReadOnly() throws NullPointerException

change is in CALCITE-843.
update drill's calcite version to 1.4.0-drill-r21

close #806

* DRILL-5375: Nested loop join: return correct result for left join closes #794

* DRILL-5315: Address small typo in the comment in drillClient.hpp closes #771

* DRILL-5355: Misc. code cleanup closes #784

* DRILL-5234: External sort's spilling functionality does not work when the spilled columns contains a map type column closes #799

* DRILL-5395: Query on MapR-DB table fails with NPE due to an issue with assignment logic closes #803

* DRILL-5387: Ignore TestBitBitKerberos and TestUserBitKerberos closes #810

* DRILL-5319: Refactor "contexts" for unit testing closes #787

This PR is purely a refactoring: no functionality is added or changed.
The refactoring splits various context and related classes into a set
of new interfaces with needed for operator-level unit tests. The other,
Drillbit-related methods are left in the original interfaces. Most code
need not change.

The changes here allow operator-level unit tests to mock up the
exec-time methods so we can use them without firing up a Drillbit (or
using mocking libraries).

A later PR will provide the sub-operator test framework that uses this
refactoring.

Changes include:

* The OptionManager is split, with read-only methods moving to a new
OptionSet interface.
* The FragmentContext is split, with an exec-only FragmentExecContext
proving low-level methods.
* OperatorStats is split, with a new OperatorStatReceiver class
providing write-only support to operators.
* Several places that accepted an OperatorContext or FragmentContext,
but needed only an allocator, are changed to accept the allocator
directly.

Includes fixes for code review comments

Adds more comments. Postpones the suggested rename until all affected
code is in master, else it will be difficult to synchronize the rename
across multiple branches.

* DRILL-5213: Prepared statement for actual query is missing the query text

close apache/drill#812

* DRILL-5409 - update MapR version to 5.2.1

close apache/drill#813

* DRILL-5415: Improve Fixture Builder to configure client properties and keep collection type properties for server

Updated with review feedback

close apache/drill#807

* DRILL-5424: Fix IOBE for reverse function

close apache/drill#815

* Test-specific column accessor implementation. Provides a simplified, unified set of access methods for value vectors specifically for wrting simple, compact unit test code.

* Interfaces for column readers and writers
* Interfaces for tuple (row and map) readers and writers
* Generated implementations
* Base implementation used by the generated code
* Factory class to create the proper reader or writer given a major
type (type and cardinality)
* Utilities for generic access, type conversions, etc.

Many vector types can be mapped to an int for get and set. One key
exception are the decimal types: decimals, by definition, require a
different representation. In Java, that is `BigDecimal`. Added get, set
and setSafe accessors as required for each decimal type that uses
`BigDecimal` to hold data.

The generated code builds on the `valueVectorTypes.tdd` file, adding
additional properties needed to generate the accessors.

The PR also includes a number of code cleanups done while reviewing
existing code. In particular `DecimalUtility` was very roughly
formatted and thus hard to follow.

Supports Drill’s interval types (INTERVAL, INTERVALDAY,
INTERVALYEAR) in the form of the Joda interval class.

Adds support for Map vectors. Maps are treated as nested tuples and are
expanded out to create a flattened row in the schema. The accessors
then access rows using the flattened column index or the combined name
(“a.b”).

Supports arrays via a writer interface that appends values as written,
and an indexed, random-access reader interface.

Removed HTTP log parser from JDBC jar to keep the JDBC jar from getting
too big.

close apache/drill#783

* DRILL-5323: Test tools for row sets

Provide test tools to create, populate and compare row sets

To simplify tests, we need a TestRowSet concept that wraps a
VectorContainer and provides easy ways to:

- Define a schema for the row set.
- Create a set of vectors that implement the schema.
- Populate the row set with test data via code.
- Add an SV2 to the row set.
- Pass the row set to operator components (such as generated code
blocks.)
- Examine the contents of a row set
- Compare the results of the operation with an expected result set.
- Dispose of the underling direct memory when work is done.

This code builds on that in DRILL-5324 to provide a complete row set
API. See DRILL-5318 for the spec.

Note: this code can be reviewed as-is, but cannot be committed until
after DRILL-5324 is committed: this code has compile-time dependencies
on that code. This PR will be rebased once DRILL-5324 is pulled into
master.

Handles maps and intervals

The row set schema is refined to provide two forms of schema. A
physical schema shows the nested structure of the data with maps
expanding into their contents.

Updates the row set schema builder to easily build a schema with maps.

An access schema shows the row “flattened” to include just scalar
(non-map) columns, with all columns at a single level, with dotted
names identifying nested fields. This form makes for very simple access.

Then, provides tools for reading and writing batches with maps by
presenting the flattened view to the row reader and writer.

HyperVectors have a very complex structure for maps. The hyper row set
implementation takes a first crack at mapping that structure into the
standardized row set format.

Also provides a handy way to set an INTERVAL column from an int. There
is no good mapping from an int to an interval, so an arbitrary
convention is used. This convention is not generally useful, but is
very handy for quickly generating test data.

As before, this is a partial PR. The code here still depends on
DRILL-5324 to provide the column accessors needed by the row reader and
writer.

All this code is getting rather complex, so this commit includes a unit
test of the schema and row set code.

Revisions to support arrays

Arrays require a somewhat different API. Refactored to allow arrays to
appear as a field type.

While refactoring, moved interfaces to more logical locations.

Added more comments.

Rejiggered the row set schema to provide both a physical and flattened
(access) schema, both driven from the original batch schema.

Pushed some accessor and writer classes into the accessor layer.

Added tests for arrays.

Also added more comments where needed.

Moved tests to DRILL-5318

The test classes previously here depend on the new “operator fixture”.
To provide a non-cyclic checkin order, moved the tests to the PR with
the fixtures so that this PR is clear of dependencies. The tests were
reviewed in the context of DRILL-5318.

Also pulls in batch sizer support for map fields which are required by
the tests.

closes #785

* DRILL-5318: Sub-operator test fixture

This commit depends on:

* DRILL-5323

This PR cannot be accepted (or built) until the above are pulled and
this PR is rebased on top of them. The PR is issued now so that reviews
can be done in parallel.

Provides the following:

* A new OperatorFixture to set up all the objects needed to test at the
sub-operator level. This relies on the refactoring to create the
required interfaces.
* Pulls the config builder code out of the cluster fixture builder so
that configs can be build for sub-operator tests.
* Modifies the QueryBuilder test tool to run a query and get back one
of the new row set objects to allow direct inspection of data returned
from a query.
* Modifies the cluster fixture to create a JDBC connection to the test
cluster. (Use requires putting the Drill JDBC project on the test class
path since exec does not depend on JDBC.)

Created a common subclass for the cluster and operator fixtures to
abstract out the allocator and config. Also provides temp directory
support to the operator fixture.

Merged with DRILL-5415 (Improve Fixture Builder to configure client
properties)

Moved row set tests here from DRILL-5323 so that DRILL-5323 is self
contained. (The tests depend on the fixtures defined here.)

Added comments where needed.

Puts code back as it was prior to a code review comment. The code is
redundant, but necessarily so due to code which is specific to several
primitive types.

closes #788

* DRILL-5391: CTAS: make folder and file permission configurable

close #820

* DRILL-5428: submit_plan fails after Drill 1.8 script revisions

When the other scripts were updated, submit_plan was not corrected.
After Drill 1.8, drill-config.sh consumes all command line arguments,
finds the —config and —site options, removes them, and places the rest
in the new args array.

This PR updates submit_plan to use the new args array.

The fix was tested on a test cluster: we verified that a physical plan
was submitted and ran.

closes #816

* DRILL-5385: Vector serializer fails to read saved SV2

Unit testing revealed that the VectorAccessorSerializable class claims
to serialize SV2s, but, in fact, does not. Actually, it writes them,
but does not read them, resulting in corrupted data on read.

Fortunately, no code appears to serialize sv2s at present. Still, it is
a bug and needs to be fixed.

First task is to add serialization code for the sv2.

That revealed that the recently-added code to save DrillBufs using a
shared buffer had a bug: it relied on the writer index to know how much
data is in the buffer. Turns out sv2 buffers don’t set this index. So,
new versions of the write function takes a write length.

Then, closer inspection of the read code revealed duplicated code. So,
DrillBuf allocation moved into a version of the read function that now
does reading and DrillBuf allocation.

Turns out that value vectors, but not SV2s, can be built from a
Drillbuf. Added a matching constructor to the SV2 class.

Finally, cleaned up the code a bit to make it easier to follow. Also
allowed test code to access the handy timer already present in the code.

closes #800

* DRILL-5344: External sort priority queue copier fails with an empty batch

Unit tests showed that the “priority queue copier” does not handle an
empty batch. This has not been an issue because code elsewhere in the
sort specifically works around this issue. This fix resolves the issue
at the source to avoid the need for future work-arounds.

closes #778

* DRILL-5423: Refactor ScanBatch to allow unit testing record readers

Refactors ScanBatch to allow unit testing of record reader
implementations, especially the “writer” classes.

See JIRA for details.

closes #811

* DRILL-4039: Query fails when non-ascii characters are used in string literals

closes #825

* DRILL-5419: Calculate return string length for literals & some string functions

1. Revisited calculation logic for string literals and some string functions
(cast, upper, lower, initcap, reverse, concat, concat operator, rpad, lpad, case statement,
 coalesce, first_value, last_value, lag, lead).
Synchronized return type length calculation logic between limit 0 and regular queries.

2. Deprecated width and changed it to precision for string types in MajorType.

3. Revisited FunctionScope and splitted it into FunctionScope and ReturnType.
FunctionScope will indicate only function usage in term of number of in / out rows, (n -> 1, 1 -> 1, 1->n).
New annotation in UDFs ReturnType will indicate which return type strategy should be used.

4. Changed MAX_VARCHAR_LENGTH from 65536 to 65535.

5. Updated calculation of precision and display size for INTE…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants