This repository contains code for gathering and uploading KBase system metrics from Condor API such as memory, disk and cpu usage, as well as job and queue information. This repository is run every hour through a cron job.
The main function of the this repo is the 'get_system_report' function located in the 'get_system_reports' file. This function calls 'get_report machines' and 'get_report_jobs', because System Metrics is composed of information for Condor jobs - running and idle - and KBase machine information. The 'get_report_jobs' function calls a job function 'get_job_info' from the file 'calculate_queue_resources'. Within the queue resources file the 'get_job_info' function is main.
Before being able to run this docker container a ".env" file needs to be made. It should be called .env and should contain the following:
- USER_TOKEN=
- SERVICE_WIZARD_URL=
- ELASTICSEARCH_HOST=
- CONDOR_JOB_URL=
- CONDOR_MACHINE_URL=
Please ask a fellow developer for the correct url paths and alter your .env file accordingly.
The script in hooks/build is used to build a docker image named "kbase/systemmetrics" from the current contents of the repo. You can simply run it by:
$ IMAGE_NAME=kbase/systemmetrics hooks/build
Once it's built, one can run the source directory by the following command:
$ docker-compose run --rm SystemMetrics
Or one can run the cron job by:
$ docker-compose run --rm SystemMetrics ../bin/cron_shell.sh
To test the output of the cron job or main script (get_errored_apps_EE2) through Logstash, one must set up a 'Logstash Listener/Debugger'. First fork then pull the Logstash repo (https://github.com/kbase/logstash) into a separate directory on docker03, construct a docker-compose.yml and env file containing the following:
IMAGE_NAME=
DOCKER_REPO=
then run:
docker run --rm -it -e debug_output=True -p 9000:9000 -p 5044:5044 kbase/logstash
If port 9000 is taken, run the following Docker commands to find the container name running on 9000:
docker ps | grep 9000
then
docker kill CONATAINER_NAME
Once the Logstash Listener/Debugger is up and running, you need to change the ELASTICSEARCH_HOST url to ...* (ask Steve Chan for the logstash debugger url to use) in your .env for your System Metrics environment. Now run the System Metrics cron job described above and view its output in the Logstash Debugger.
To test any System Metrics code without sending logs to Logstash, please commit out the following line in get_system_reports:
c.to_logstashJson(queue_dict)
Run the system metrics container and make sure the container as text editors such as nano or zile:
apt-get update
apt-get install zile
Once you're debugging environment is setup in the Docker Container go ahead; edit, test and run in python.
Currently they are
- created
- estimating
- queued
- running
- finished # Successful run legacy code
- completed # Successful run in ee2
- error # Failed run # Something went wrong and job failed # Possible Reasons are (ErrorCodes)
- terminated = # Canceled by user # Canceled by user, admin, or script # Possible Reasons are (TerminatedCodes)
Integer which indicates the current status of the job. Value Idle 1 Idle 2 Running 3 Removing 4 Completed 5 Held 6 Transferring Output 7 Suspended
These are not a one to one mapping, and each tell you different information.
If a job is IDLE, it can still run in condor
If a job is RUNNING, it is currently running in condor
If a job is HELD, it will probably never run again, depending on the HOLD REASON (See HTCondor Manual)
If the hold reason is 16 Input files are being spooled
then the job is about to enter the idle state, otherwise, it will never run again.