-
Notifications
You must be signed in to change notification settings - Fork 61
Presto
Table of Contents
Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. It is a tool designed to efficiently query vast amounts of data using distributed queries. If you work with terabytes or petabytes of data, you are likely using tools that interact with Hadoop and HDFS. Presto was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs such as Hive or Pig, but Presto is not limited to accessing HDFS. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra. Presto also allows you to perform a join on tables from different data sources and it is all done in memory.
Presto was designed to handle data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).
### Requirements- Java 8
Latest Presto version requires Java 8. If Java 8 is not installed on all the machines, install it as follows
node:~$ sudo add-apt-repository ppa:webupd8team/java -y node:~$ sudo apt-get update node:~$ sudo apt-get install oracle-java8-installer### Install Presto
This installation process needs to be executed on all nodes you want to setup your Presto cluster on.
node:~$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.115/presto-server-0.115.tar.gz -P ~/Downloads node:~$ sudo tar zxvf ~/Downloads/presto-server-* -C /usr/local node:~$ sudo mv /usr/local/presto-server-* /usr/local/presto Set the PRESTO_HOME environment variable and add to PATH in .profile node:~$ nano ~/.profile Add the following for the Presto environment variables export PRESTO_HOME=/usr/local/presto export PATH=$PATH:$PRESTO_HOME/bin node:~$ . ~/.profile
Presto needs a data directory for storing logs, local metadata, etc. It is recommended creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.
node:~$ mkdir -p /var/presto/data### Configure Presto
Create an etc directory inside the installation directory. This will hold the following configuration:
- Node Properties: environmental configuration specific to each node
- JVM Config: command line options for the Java Virtual Machine
- Config Properties: configuration for the Presto server
- Catalog Properties: configuration for Connectors (data sources)
The node properties file, etc/node.properties, contains configuration specific to each node. A node is a single installed instance of Presto on a machine. The following is a minimal etc/node.properties:
node:~$ mkdir $PRESTO_HOME/etc node:~$ nano $PRESTO_HOME/etc/node.properties Add the following to the node.properties file node.environment=production # All nodes in a cluster have the same environment node.data-dir=/var/presto/data # Presto stores logs and other data here Also, every node has a unique id. So, let’s generate one and add it to node.properties node:~$ TEMPUID=$(uuidgen) node:~$ echo node.id=$TEMPUID >> $PRESTO_HOME/etc/node.properties
The JVM config file, etc/jvm.config, contains a list of command line options used for launching the Java Virtual Machine.
node:~$ nano $PRESTO_HOME/etc/jvm.config Add the following to the jvm.config file -server -Xmx16G -XX:+UseG1GC -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p
The config properties file, etc/config.properties, contains the configuration for the Presto server. Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters.
For the node you choose to be the coordinator, configure it as follows
coordinator-node:~$ nano $PRESTO_HOME/etc/config.properties Add the following to the config.properties file coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 task.max-memory=1GB query.max-memory=50GB query.max-memory-per-node=1GB discovery-server.enabled=true discovery.uri=http://example.net:8080
For all other worker nodes, configure all of them as follows
worker-node:~$ nano $PRESTO_HOME/etc/config.properties Add the following to the config.properties file coordinator=false http-server.http.port=8080 task.max-memory=1GB query.max-memory=50GB query.max-memory-per-node=1GB discovery.uri=http://example.net:8080
Next, create log.properties in etc as follows. This should be done on both coordinator and worker nodes.
node:~$ nano $PRESTO_HOME/etc/log.properties Add the following to the log.properties file com.facebook.presto=DEBUG
Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table clicks in database web, that table would be accessed in Presto as hive.web.clicks. Let’s mount Hive, Cassandra and jmx connectors as follows:
This should be done on both coordinator and worker nodes.
node:~$ mkdir $PRESTO_HOME/etc/catalog node:~$ nano $PRESTO_HOME/etc/catalog/jmx.properties Add the following to the jmx.properties file connector.name=jmx node:~$ nano $PRESTO_HOME/etc/catalog/cassandra.properties Add the following to the cassandra.properties file connector.name=cassandra cassandra.contact-points=<public-ip> # Public IP of at least one of the cassandra nodes node:~$ nano $PRESTO_HOME/etc/catalog/hive.properties Add the following to the hive.properties file connector.name=hive-hadoop2 hive.metastore.uri=thrift://<private-ip>:9083 # private IP of the node where hive is installed. It should be the same as hadoop’s namenode.### Starting Presto
This should be done on all the nodes you have your Presto cluster on.
node:~$ sudo $PRESTO_HOME/bin/launcher run &
You can check the Web UI on coordinator-node-public-dns:8080. It would look like this:
To stop presto on any node, do the following:
node:~$ sudo $PRESTO_HOME/bin/launcher stop### Presto Command Line Interface (CLI)
Install this only on the coordinator node. However, it can be installed on any node (any node which doesn’t even have Presto installed) and you would just have to point it to your coordinator node.
For installing CLI on the coordinator/worker node.
node:~$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.115/presto-cli-0.115-executable.jar node:~$ mv presto-cli-0.115-executable.jar $PRESTO_HOME/presto_cli node:~$ chmod +x presto_cli Let’s start the CLI node:~$ $PRESTO_HOME/presto_cli --server localhost:8080 --catalog hive --schema default # this will connect to the hive default database. You can also connect to a Cassandra keyspace as follows node:~$ $PRESTO_HOME/presto_cli --server localhost:8080 --catalog cassandra --schema keyspace-name
At this point, don’t worry about how to use presto. There is a dev on how to use it, referenced at the end of this dev.
If you are installing the CLI on a node other than your coordinator node, do as follows:
node:~$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.115/presto-cli-0.115-executable.jar node:~$ mv presto-cli-0.115-executable.jar /usr/local/presto_cli node:~$ chmod +x presto_cli Let’s start the CLI node:~$ /usr/local/presto_cli --server <coordinator-node-private-ip>:8080 --catalog hive --schema default # this will connect to the hive default database. You can also connect to a Cassandra keyspace as follows node:~$ /usr/local/presto_cli --server <coordinator-node-private-ip>:8080 --catalog cassandra --schema keyspace-name### Checking Presto nodes
Execute the following query on the node you have CLI installed on to see all nodes in the Presto Cluster.
First start the CLI node:~$ /usr/local/presto_cli --server localhost:8080 --catalog hive --schema default node:~$ SELECt * FROM system.runtime.nodes;
You should see the coordinator node and all the worker nodes.
Your Presto cluster is ready to be used. Refer to Presto Example Dev for how to use Presto to query data from Hive and Cassandra.
If you are curious, read this blog about Presto architecture.
### Presto ExampleIn this example we will query tables in Hive and Cassandra from Presto.
### Requirements * Hive * Cassandra ### Create a table in CassandraThis could be done on any node in the Cassandra cluster.
node:~$ cqlsh cqlsh> CREATE KEYSPACE presto WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}; cqlsh> USE insight; cqlsh:presto> CREATE TABLE user_purchases (user_id INT, item TEXT, quanity INT, amount FLOAT, place TEXT, PRIMARY KEY (user_id)); cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (1, 'Shirt', 2, 3050.50, 'New York'); cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (2, 'Shoes', 3, 8140.60, 'Palo Alto'); cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (3, 'Mobile Phone', 1, 18300.00, 'San Francisco'); cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (4, 'Laptop', 1, 40140.60, 'New York'); Check the table contents cqlsh:insight> SELECT * FROM user_purchases;
It would look as follows:
### Create a table in HiveWe will create a CSV file and populate it with some data. This should be done on the Hadoop namenode.
node:~$ nano user_info.csv Add the following lines to the CSV file 1,Steven,Smith,24,42000,Male 2,Pawan,Lathwal,24,30000,Male 3,Mariya,Gilbert,25,44000,Female 4,Taylor,Lockwood,24,41000,Male 5,Sanjiv,Singh,25,51000,Male node:~$ hdfs dfs -mkdir /data node:~$ hdfs dfs -copyFromLocal user_info.csv /data
Start Hive on the namenode.
namenode:~$ hive hive> create external table user_info2 (id INT, fname STRING, lname STRING, age INT, salary INT, gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; hive> LOAD DATA INPATH '/data/user_info.csv' INTO TABLE user_info; Check the table contents hive> SELECT * FROM user_info;
It would look as follows:
Note that the Hive metastore should be started as Presto connects to the Hive metastore to read all the data. Start Hive metastore using the following command:
namenode:~$ hive --service metastore
Now, let’s start Presto CLI on the node you installed Presto CLI on.
node:~$ $PRESTO_HOME/presto_cli --server localhost:8080 --catalog cassandra --schema presto You can check the Cassandra table presto:presto> SHOW TABLES; You can also check the table contents presto:presto> SELECT * FROM user_purchases; You can also check the Hive table contents presto:presto> SELECT * FROM hive.default.user_info;
You can think of having these two tables in the same database. Now, let’s write a query which executes a join on these two tables
presto:presto> SELECT fname, lname, age, item, place FROM user_purchases join hive.default.user_info on hive.default.user_info.id = user_id;
And the results would look as follows:
Presto can perform joins on tables from multiple data sources and all the computation is done in memory, which makes it really powerful and useful for ad hoc queries.
Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.