Creates fully configured and ready to use druid cluster, which uses s3
as a deep storage and mysql
as a metadata store.
You can start everything with this command:
docker-compose up
All environment variables ${}
in docker compose file are substituted by values defined in .env
file
By default it ships with 0.20.0
version. If you are willing to change the version use this command:
docker-compose build --build-arg DRUID_VERSION="your-version"
This starts a Druid cluster including:
- Zookeeper
- zoonavigator-web - Web interface for managing configuration stored whithin the ZooKeeper. Runs on
8000
port. - zoonavigator-api - Just a backend to serve zoonavigator-web requests.
- zookeeper - ZooKeeper itself. Runs on
2181
port.
- zoonavigator-web - Web interface for managing configuration stored whithin the ZooKeeper. Runs on
- S3
- s3-minio - An open source service to mimic S3 API and behavior. Runs on
9000
port.
- s3-minio - An open source service to mimic S3 API and behavior. Runs on
- Mysql
- mysql - Offical mysql image, 5.7.26 version is used. Runs on
3306
port.
- mysql - Offical mysql image, 5.7.26 version is used. Runs on
NOTE
Druid is very sensitive to memory settings. Despite my attempts to reduce buffer size and other settings related to
memory consumption, it requires at least 4 GB
of memory to run. Don't forget to adjust Docker's memory settings to
fit the requirements.
The most useful services:
- Druid Management UI - http://localhost:8080
- S3 Minio UI - http://localhost:9000
access -druid-s3-access
secret -druid-s3-secret
- Zookeeper Web UI - http://localhost:8000
connection string:zookeeper:2181
- Mysql connection
jdbc:mysql://localhost:3306/druid
user name -druid
password -diurd
(root password -1234
)
- coordinator runs on
8081
port
The coordinator is primarily responsible for segment management and distribution. More specifically, the Druid Coordinator process communicates to Historical processes to load or drop segments based on configurations. The Druid Coordinator is responsible for loading new segments, dropping outdated segments, managing segment replication, and balancing segment load.
- broker runs on
8082
port
The Broker is the process to route queries to if you want to run a distributed cluster. It understands the metadata published to ZooKeeper about what segments exist on what processes and routes queries such that they hit the right processes. This process also merges the result sets from all of the individual processes together. On start up, Historical processes announce themselves and the segments they are serving in Zookeeper.
- historical runs on
8083
port
Each Historical process maintains a constant connection to Zookeeper and watches a configurable set of Zookeeper paths for new segment information. Historical processes do not communicate directly with each other or with the Coordinator processes but instead rely on Zookeeper for coordination.
- overlord runs on
8090
port
The Overlord process is responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers. Overlord can be configured to run in one of two modes - local or remote. In local mode Overlord is also responsible for creating Peons for executing tasks. When running the Overlord in local mode, all MiddleManager and Peon configurations must be provided as well. Our cluster is configured to run in remote mode.
- middlemanager runs on
8091
port
The MiddleManager process is a worker process that executes submitted tasks. Middle Managers forward tasks to Peons that run in separate JVMs.
- router runs on
8080
port
In addition to query routing, the Router also runs the Druid Console, a management UI for datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
Go to the router page and you should see the following UI
Go to the s3-minio page and you should see the following UI (secrets could be found in running section)
Go to the zookeeper web page and you should see the following UI (secrets could be found in running section)
Go to the s3-minio and create a new bucket called testbucket
, then upload wikiticker-2015-09-12-sampled.json
.
The file can be found inside the druid-docker/data
folder.
After uploading the file go to the Router UI then click Load data
-> Amazon S3
-> Connect data
Enter s3://testbucket/wikiticker-2015-09-12-sampled.json.gz
to S3 URIs
, Apply
then click Parse data
You will see the raw content of your json file
Now we will explain to Druid how to parse our json file. Click on Edit spec
button and insert the spec defined in wikipedia-index.json
Press Submit
button and vuola Druid has started running the task.
After the ingestion task has been finished, the data will be loaded by Historical processes and available for querying within a minute or two. The segment files could be found in s3 storgae as well.
Alternatively you could start the process using the following curl
command.
It will submit the task, POST it to Druid's overlord (by default it runs on 8090)
curl -X 'POST' -H 'Content-Type:application/json' -d @your-repository-folder/wikipedia-index.json http://localhost:8090/druid/indexer/v1/task
Now open sql tab: http://localhost:8080/unified-console.html#query
You are free to run any sql query, for example:
SELECT page, COUNT(*) AS Edits
FROM wikipedia
WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
GROUP BY page
ORDER BY Edits DESC LIMIT 10
Or any other query as demonstared in this tutorial.
- ZOOKEEPER_SERVER - The zookeeper server address.
- DRUID_SERVICE - The name of the druid service [
broker
,historical
,coordinator
,overlord
,middleManager
,router
] - DRUID_HOST - The advertiser address that uses druid to expose the service on zookeeper.
- DRUID_SERVICE_PORT - The service port where bind the druid service.
- DRUID_JVM_ARGS - The JVM arguments to execute the druid services.
- DRUID_PULL_EXTENSION - The druid extension to download, this download is performed at running time. You need to pass the extension with whitespace:
extension1:extension1:version
extension2:extension2:version
- AWS_REGION - AWS region this is needed to work with AWS S3 extension.
- S3_ACCESS_KEY - Your S3 access key
- S3_SECRET_KEY - Your S3 secret key
- S3_STORAGE_BUKCET - S3 bucket for segmnts to be stored into, must exists
- S3_STORAGE_BASE_KEY - S3 segments path key
- METADATA_STORAGE_HOST - The name of the metadata storage [
mysql
,postgresql
,derby
] - METADATA_STORAGE_PORT - The port metadata storage is listening on
- METADATA_STORAGE_USER - The user name to connect to the metadata storage
- METADATA_STORAGE_PASSWORD - The password to connect to the metadata storage
- Configuring
hadoop-client
In order to enable Hadoop indexing hadoop-client
should be properly configured.
To configure client Hadoop configuration files (e.g. core-site.xml
, yarn-site.xml
, etc) should exist on the Druid's classpath.
Just place them under ~/drud/conf/_common/hadoop_xml/
folder and druid-start.sh
script will add them to the Druid’s classpath.
- Changing Hadoop version
Changing Hadoop version is as simple as setting HADOOP_VERSION
variable.
By default it is taken from common_env_config
file, e.g.
HADOOP_VERSION=2.9.2
- Loading additional extension
Any additional extensions could be loaded by modifying DRUID_HADOOP_EXTENIONS
variable.
E.g. Adding aws-s3 supprot:
DRUID_HADOOP_EXTENIONS=org.apache.hadoop:hadoop-aws
- Configuring indexing service which is part of the
middlemanager
Change druid.indexer.task.defaultHadoopCoordinates
property in the ~/middleManager/runtime.properties_env
file
E.g.
druid.indexer.task.defaultHadoopCoordinates=["org.apache.hadoop:hadoop-client:$HADOOP_VERSION","org.apache.hadoop:hadoop-aws:$HADOOP_VERSION"]
- Using Amazon's
S3
as a deep storage isntead ofminio
Comment the following properties ~/druid/_common/common.runtime.properties_env
file:
druid.s3.enablePathStyleAccess=true
druid.s3.endpoint.url=http://s3-minio:9000
druid.s3.endpoint.signingRegion=us-east-1
And don't foget to modify these variables:
AWS_REGION
S3_STORAGE_BUKCET
S3_STORAGE_BASE_KEY
S3_ACCESS_KEY
S3_SECRET_KEY
- build an image
docker build -t druid-mysql-image .
- run as a service
docker run -dit --name mysql -p 3306:3306 --network druid-net --env-file ../.env druid-mysql-image
- run as a service
docker run -dit --name s3-minio -p 9000:9000 --network druid-net -e MINIO_ACCESS_KEY=druid-s3-access -e MINIO_SECRET_KEY=druid-s3-secret --env-file .env minio/minio:latest server /data
- run as a service
docker run -dit --name zookeeper -p 2181:2181 --network druid-net -e ZOO_MY_ID=1 zookeeper:latest
- build an image
docker build -t druid-base-image .
- run as a service
docker run -dit --name coordinator -p 8080:8081 --network druid-net \
-e DRUID_SERVICE=coordinator \
-e "DRUID_JVM_ARGS=-server -Xms256m -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager" \
-e S3_ACCESS_KEY=druid-s3-access \
-e S3_SECRET_KEY=druid-s3-secret \
-e METADATA_STORAGE_USER=druid \
-e METADATA_STORAGE_PASSWORD=diurd \
--env-file common_env_config druid-base-image
- run as a service
docker run -dit --name broker -p 8383:8082 --network druid-net \
-e DRUID_SERVICE=broker \
-e "DRUID_JVM_ARGS=-server -Xms3g -Xmx3g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:NewSize=512m -XX:MaxNewSize=512m -XX:MaxDirectMemorySize=3g -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dcom.sun.management.jmxremote.port=17071 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false" \
-e S3_ACCESS_KEY=druid-s3-access \
-e S3_SECRET_KEY=druid-s3-secret \
-e METADATA_STORAGE_USER=druid \
-e METADATA_STORAGE_PASSWORD=diurd \
--env-file common_env_config druid-base-image
- run as a service
docker run -dit --name historical -p 8083:8083 --network druid-net \
-e DRUID_SERVICE=historical \
-e "DRUID_JVM_ARGS=-server -Xms2g -Xmx2g -XX:MaxDirectMemorySize=3g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:NewSize=1g -XX:MaxNewSize=1g -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
-e S3_ACCESS_KEY=druid-s3-access \
-e S3_SECRET_KEY=druid-s3-secret \
-e METADATA_STORAGE_USER=druid \
-e METADATA_STORAGE_PASSWORD=diurd \
--env-file common_env_config druid-base-image
- run as a service
docker run -dit --name overlord -p 8082:8090 --network druid-net \
-e DRUID_SERVICE=overlord \
-e "DRUID_JVM_ARGS=-server -Xms256m -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager" \
-e S3_ACCESS_KEY=druid-s3-access \
-e S3_SECRET_KEY=druid-s3-secret \
-e METADATA_STORAGE_USER=druid \
-e METADATA_STORAGE_PASSWORD=diurd \
--env-file common_env_config druid-base-image
- run as a service
docker run -dit --name middlemanager -p 8091:8091 --network druid-net \
-e DRUID_SERVICE=middleManager \
-e "DRUID_JVM_ARGS=-server -Xms256m -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager" \
-e S3_ACCESS_KEY=druid-s3-access \
-e S3_SECRET_KEY=druid-s3-secret \
-e METADATA_STORAGE_USER=druid \
-e METADATA_STORAGE_PASSWORD=diurd \
--env-file common_env_config druid-base-image
- run as a service
docker run -dit --name router -p 8081:8080 --network druid-net \
-e DRUID_SERVICE=router \
-e "DRUID_JVM_ARGS=-server -Xms256m -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager" \
-e S3_ACCESS_KEY=druid-s3-access \
-e S3_SECRET_KEY=druid-s3-secret \
-e METADATA_STORAGE_USER=druid \
-e METADATA_STORAGE_PASSWORD=diurd \
--env-file common_env_config druid-base-image