Please install the following software:
- git version >= 2.37.1
- Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2. Make sure that docker is running using
docker ps
.
Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow the steps here (only Step 1 is necessary). Please install the make command with sudo apt install make -y
(if its not already present).
All the commands shown below are to be run via the terminal (use the Ubuntu terminal for WSL users). We will use docker to set up our containers. Clone and move into the lab repository, as shown below.
git clone \
https://github.com/josephmachado/analytical_dp_with_sql.git
cd analytical_dp_with_sql
Note: If you are using Macbook M1, please follow the instructions here to use the appropriate docker image.
We have some helpful make commands to make working with our systems more accessible. Shown below are the make commands and their definitions
- make up: Spin up the docker containers.
- make trino: Open trino cli; Use exit to quit the cli. This is where you will type your SQL queries.
- make down: Stop the docker containers.
You can see the commands in this Makefile. If your terminal does not support make commands, please use the commands in the Makefile directly. All the commands in this book assume that you have the docker containers running.
In your terminal, do the following:
# Make sure docker is running using docker ps
make up # starts the docker containers
# If you are having issues with existing containers
# stop them all with the following command
# docker rm -f $(docker ps -a -q)
sleep 60 # wait 1 minute for all the containers to set up
make trino # opens the trino cli
In Trino, we can connect to multiple databases (called catalogs in Trino). TPC-H is a dataset used to benchmark analytical database performance. Trino's tpch catalog comes with preloaded tpch datasets of different sizes tiny, sf1, sf100, sf100, sf300, and so on, where sf = scaling factor.
-- run "make trino" or
-- "docker container exec -it trino-coordinator trino"
-- to open trino cli
USE tpch.tiny;
SHOW tables;
SELECT * FROM orders LIMIT 5;
-- shows five rows, press q to quit the interactive results screen
exit -- quit the cli
Note: Run make trino
or docker container exec -it trino-coordinator trino
on your terminal to open the trino cli. The SQL code shown throughout the book assumes you are running it in trino cli.
Starting the docker containers will also start Minio(S3 alternative); we will use Minio as our data store to explain efficient data storage.
UI: Open the Trino UI at http://localhost:8080 (username: any word) and Minio (S3 alternative) at http://localhost:9001 (username: minio, password: minio123) in a browser of your choice.
If you prefer to connect to Trino via a SQL IDE, download DBeaver (addresses issues mentioned here). Open DBeaver
,
- Click on Database -> New Database Connection
- A
Connect to a database
box will open; search for, selectTrino
, and press Next. - Do not change settings; use user as the user name. You will get a
connected
text box if you test the connection. Click Finish, and you will be able to explore our Trino database.
The TPC-H data represents a car parts seller's data warehouse, where we record orders, items that make up that order (lineitem), supplier, customer, part(parts sold), region, nation, and partsupp.
Note: Have a copy of the data model as you follow along; this will help with the examples provided and answering exercise questions.