Compass is a big data task diagnosis platform, which aims to improve the efficiency of user troubleshooting and reduce the cost of abnormal tasks for users.
The key features:
-
Non-invasive, instant diagnosis, you can experience the diagnostic effect without modifying the existing scheduling platform.
-
Supports multiple scheduling platforms(DolphinScheduler, Airflow, or self-developed etc.)
-
Supports Spark 2.x or 3.x, Hadoop 2.x or 3.x troubleshooting.
-
Supports workflow layer exception diagnosis, identifies various failures and baseline time-consuming abnormal problems.
-
Supports Spark engine layer exception diagnosis, including 14 types of exceptions such as data skew, large table scanning, and memory waste.
-
Supports various log matching rule writing and abnormal threshold adjustment, and can be optimized according to actual scenarios.
Compass has supported the concept of diagnostic types:
Diagnostic Dimensions | Diagnostic Type | Type Description |
Failure analysis | Run failure | Tasks that ultimately fail to run |
First failure | Tasks that have been retried more than once | |
Long term failure | Tasks that have failed to run in the last ten days | |
Time analysis | Baseline time abnormality | Tasks that end earlier or later than the historical normal end time |
Baseline time-consuming abnormality | Tasks that run for too long or too short relative to the historical normal running time | |
Long running time | Tasks that run for more than two hours | |
Error analysis | SQL failure | Tasks that fail due to SQL execution issues |
Shuffle failure | Tasks that fail due to shuffle execution issues | |
Memory overflow | Tasks that fail due to memory overflow issues | |
Cost analysis | Memory waste | Tasks with a peak memory usage to total memory ratio that is too low |
CPU waste | Tasks with a driver/executor calculation time to total CPU calculation time ratio that is too low | |
Efficiency analysis | Large table scanning | Tasks with too many scanned rows due to no partition restrictions |
OOM warning | Tasks with a cumulative memory of broadcast tables and a high memory ratio of driver or executor | |
Data skew | Tasks where the maximum amount of data processed by the task in the stage is much larger than the median | |
Job time-consuming abnormality | Tasks with a high ratio of idle time to job running time | |
Stage time-consuming abnormality | Tasks with a high ratio of idle time to stage running time | |
Task long tail | Tasks where the maximum running time of the task in the stage is much larger than the median | |
HDFS stuck | Tasks where the processing rate of tasks in the stage is too slow | |
Too many speculative execution tasks | Tasks in which speculative execution of tasks frequently occurs in the stage | |
Global sorting abnormality | Tasks with long running time due to global sorting |
git clone https://github.com/cubefs/compass.git
cd compass
mvn package -DskipTests
cd dist/compass
vim bin/compass_env.sh
# Scheduler MySQL
export SCHEDULER_MYSQL_ADDRESS="ip:port"
export SCHEDULER_MYSQL_DB="scheduler"
export SCHEDULER_DATASOURCE_USERNAME="user"
export SCHEDULER_DATASOURCE_PASSWORD="pwd"
# Compass MySQL
export COMPASS_MYSQL_ADDRESS="ip:port"
export COMPASS_MYSQL_DB="compass"
export SPRING_DATASOURCE_USERNAME="user"
export SPRING_DATASOURCE_PASSWORD="pwd"
# Kafka
export SPRING_KAFKA_BOOTSTRAPSERVERS="ip1:port,ip2:port"
# Redis
export SPRING_REDIS_CLUSTER_NODES="ip1:port,ip2:port"
# Zookeeper
export SPRING_ZOOKEEPER_NODES="ip1:port,ip2:port"
# Elasticsearch
export SPRING_ELASTICSEARCH_NODES="ip1:port,ip2:port"
./bin/start_all.sh
Compass is licensed under the Apache License, Version 2.0 For detail see LICENSE and NOTICE.