This is a PrestoDB connector that allows you to query unstructured data from the open source Manta object store or the public cloud Triton Object Storage service.
The Presto Manta Connector does not require Hive and is a fully stand-alone connector.
The Presto Manta Connector's install involves adding a jar
file to your
plugins directory, adding a catalog configuration file and uploading table
definition file(s) to Manta. You can find details on the install process in
the installation documentation.
- Create a new file:
$PRESTO_HOME/etc/catalog/manta.properties
- Within that file add the following:
# Required connector name to indicate that we are using the Manta plugin
connector.name=manta
# Manta configuration properties (optional if defined elsewhere).
# You can define the Manta connection properties via environment variables,
# Java system properties, or within this file below.
# See: https://github.com/joyent/java-manta/blob/master/USAGE.md#parameters
manta.user=my.username
manta.key_id=00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
manta.max_connections=48
# Schema definition - Each schema is defined by specifying a Manta directory
# path. Within that remote Manta directory path, the plugin will look for
# schema information files like: presto-tables.json
manta.schema.default=/my.username/stor/presto/schema1
manta.schema.test=/my.username/stor/presto/schema2
Tables within a schema are defined in the file presto-tables.json
which is
contained within the schema directory. The actual data files for the schema
can be located in a different directory path specified within presto-tables.json
.
The format of table definition file presto-tables.json
JSON with comments
supported. You can find an example file here.
With each table definition configuration element (see above), you can optionally specify the column names and data types for each row. If the columns are not explicitly specified, then the plugin will do a best effort guess about the data types based on the very first row read.
Input files can be partitioned based on the file path or directory path within Manta. Partitioning is defined per table and uses a configurable scheme of regular expression matching groups to allow for matching portions of a file path.
- BZIP2 with the file extension
.bz2
- LZ4 with the file extension
.lz4
- GZIP with the file extension
.gz
- Xerial SNAPPY with the file extension
.xsnappy
- Hadoop SNAPPY with the file extension
.snappy
- XZ with the file extension
.xz
The Hadoop Snappy native libraries can optionally be loaded in order to get better performance with files that have been compressed in the Hadoop-specific Snappy format.
Currently the only data format supported is newline delimited JSON with each line having a JSON object that is identical in structure without missing nodes. In future versions, parsing of JSON will become more flexible and other data formats will be supported like CSV and parquet.
All compressed data files must have a filename extension that matches the compression algorithm.
Column parsing is done by reading the first line of the smallest file in the logical table file path. If this first line differs structurally from the data in other lines and files, you will get inconsistent results or errors.
For non-compressed data files the connector will do a HTTP range request on the
data file in order to avoid downloading the entire file to get the first line.
The setting for the maximum number of bytes per line is configurable via the
manta.max_bytes_per_line
parameter. The default value is 10240
.
All queries to Manta involve downloading multiple data files off of Manta to Presto worker(s). By the design, this is a bandwidth intensive operation. It is best to have your Presto workers and server located geographically near your Manta installation with a high bandwidth link between them. For example, in the case of the Triton Public Cloud, if you are using the Manta installation located in the US-EAST region, then running Presto in one of the US-EAST data centers / availability zones is ideal.
Since queries to Manta from Presto are done concurrently per remote file,
you may want to increase the maximum connections setting to Manta above the
default of 24
. If you see errors related to timeouts when waiting for a
connection from the HTTP pool for Apache HTTP client, it is indicative of a
manta.max_connections
setting too low.
To build the Presto Manta Connector you will need Maven 3.0+. Using Maven, execute:
mvn clean install
See our contribution guide for more information on contributing changes to the project.
See https://github.com/joyent/presto-manta/issues.
The Presto Manta Connector is licensed under the MPLv2. Please see the LICENSE file for more details.