Skip to content
This repository has been archived by the owner on Jul 14, 2021. It is now read-only.

mozilla-platform-ops/relops-hardware-controller

Repository files navigation

Release Operations Controller or "roller"

This is a service for managing Firefox release operations (RelOps) hardware. It is a rewrite of the build-slaveapi based on tecken to help migrate from buildbot to taskcluster.

Architecture

The service consists of a Django Rest Framework web API, Redis-backed Celery queue, and one or more Celery workers. It should be run behind a VPN.

                 +-----------------------------------------------------------------------------+
                 | VPN                                                                         |
                 |                                                                             |
+------------+   |   +--------------+     +----------------+     +-----------+     +--------+  |
|            |   |   |    Roller    |     |    Roller      |     |  Roller   +----->        |  |
|  TC Dash.  +------->    API       +----->    Queue       +----->  Workers  |     |  HW 1  |  |
|            |   |   |              |     |                |     |           <-----+        |  |
|            <-------+              <-----+                <-----+           |     +--------+  |
|            |   |   |              |     |                |     |           |                 |
+------------+   |   +----+---+-----+     +----------------+     |           |     +--------+  |
                 |                                               |           +----->        |  |
                 |                                               |           |     |  HW 2  |  |
                 |                                               |           <-----+        |  |
                 |                                               |           |     +--------+  |
                 |                                               |           |                 |
                 |                                               |           |     +--------+  |
                 |                                               |           +----->        |  |
                 |                                               |           |     |  HW 3  |  |
                 |                                               |           <-----+        |  |
                 |                                               +-----------+     +--------+  |
                 |                                                                             |
                 |                                                                             |
                 +-----------------------------------------------------------------------------+

Data Flow

After a Roller admin registers an action with taskcluster, a sheriff or RelOps operator on a worker page of the taskcluster dashboard can use the actions dropdown to trigger an action (ping, reboot, reimage, etc.) on a RelOps managed machine.

Under the hood, the taskcluster dashboard makes a CORS request to Roller API, which checks the Taskcluster authorization header and scopes then queues a Celery task for the Roller worker to run. (There is an open issue for sending notifications back to the user).

data flow sequence diagram

API

POST /api/v1/workers/$worker_id/group/$worker_group/jobs?task_name=$task_name

URL for worker-context Taskcluster actions that needs to be registered.

URL params:

  • $worker_id the Taskcluster Worker ID e.g. ms1-10. 1 to 128 characters in long.

  • $worker_group the Taskcluster Worker Group e.g. mdc1 usually a datacenter for RelOps hardware. 1 to 128 characters in long.

Query param:

  • $task_name the celery task to run. Must be in TASK_NAMES in settings.py

Taskcluster does not POST data/body params.

Example request from Taskcluster:

POST http://localhost:8000/api/v1/workers/dummy-worker-id/group/dummy-worker-group/jobs?task_name=ping
Authorization: Hawk ...

Example response:

{"task_name":"ping","worker_id":"dummy-worker-id","worker_group":"dummy-worker-group","task_id":"e62c4d06-8101-4074-b3c2-c639005a4430"}

Where task_name, worker_id, and worker_group are as defined in the request and task_id is the task's Celery AsyncResult UUID.

Operations

Running

To run the service fetch the roller image and redis:

docker pull mozilla/relops-hardware-controller
docker pull redis:3.2

The roller web API and worker images run from one docker container.

Copy the example settings file (if you don't have the repo checked out: wget https://raw.githubusercontent.com/mozilla-services/relops-hardware-controller/master/.env-dist):

cp .env-dist .env

In production, use --env ENV_FOO=bar instead of an env var file.

Then docker run the containers:

docker run --name roller-redis --expose 6379 -d redis:3.2
docker run --name roller-web -p 8000:8000 --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller -d web
docker run --name roller-worker --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller -d worker

Check that it's running:

docker ps
CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS              PORTS                    NAMES
f45d4bcc5c3a        mozilla/relops-hardware-controller        "/bin/bash /app/bi..."   3 minutes ago       Up 3 minutes        8000/tcp                 roller-worker
c48a68ad887c        mozilla/relops-hardware-controller        "/bin/bash /app/bi..."   3 minutes ago       Up 3 minutes        0.0.0.0:8000->8000/tcp   roller-web
d1750321c4df        redis:3.2                                 "docker-entrypoint..."   9 minutes ago       Up 8 minutes        6379/tcp                 roller-redis

curl -w '\n' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' http://localhost:8000/api/v1/workers/tc-worker-1/group/ndc2/jobs\?task_name\=ping
<h1>Bad Request (400)</h1>

docker logs roller-web
[2018-01-10 08:27:23 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2018-01-10 08:27:23 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2018-01-10 08:27:23 +0000] [5] [INFO] Using worker: egg:meinheld#gunicorn_worker
[2018-01-10 08:27:23 +0000] [8] [INFO] Booting worker with pid: 8
[2018-01-10 08:27:23 +0000] [10] [INFO] Booting worker with pid: 10
[2018-01-10 08:27:23 +0000] [12] [INFO] Booting worker with pid: 12
[2018-01-10 08:27:23 +0000] [13] [INFO] Booting worker with pid: 13
172.17.0.1 - - [10/Jan/2018:08:31:46 +0000] "POST /api/v1/workers/tc-worker-1/group/ndc2/jobs HTTP/1.1" 400 26 "-" "curl/7.43.0"
172.17.0.1 - - [10/Jan/2018:08:31:46 +0000] "- - HTTP/1.0" 0 0 "-" "-"
Configuration

Roller uses an environment variable called DJANGO_CONFIGURATION that defaults to Prod to pick which composable configuration to use.

In addition to the usual Django, Django Rest Framework and Celery settings we have:

Web Server Environment Variables
  • TASKCLUSTER_CLIENT_ID The Taskcluster CLIENT_ID to authenticate with

  • TASKCLUSTER_ACCESS_TOKEN The Taskcluster access token to use

Web Server Settings
  • CORS_ORIGIN Which origin to allow CORS requests from (returning CORS access-control-allow-origin header) Defaults to localhost in Dev and tools.taskcluster.net in Prod

  • TASK_NAMES List of management commands can be run from the API. Defaults to ping in Dev and reboot in prod.

Worker Environment Variables
  • BUGZILLA_URL URL for the Bugzilla REST API e.g. https://landfill.bugzilla.org/bugzilla-5.0-branch/rest/

  • BUGZILLA_API_KEY API for using the Bugzilla REST API

  • XEN_URL URL for the Xen RPC API http://xapi-project.github.io/xen-api/usage.html

  • XEN_USERNAME Username to authenticate with the Xen management server

  • XEN_PASSWORD Password to authenticate with the Xen management server

  • ILO_USERNAME Username to authenticate with the HP iLO management interface

  • ILO_PASSWORD Password to authenticate with the HP iLO management interface

  • FQDN_TO_SSH_FILE Path to the JSON file mapping FQDNs to SSH username and key file paths example in settings.py. The ssh keys need to be mounted when docker is run. For example with docker run -v host-ssh-keys:.ssh --name roller-worker. The ssh user on the target machine should use ForceCommand to only allow the command reboot or shutdown default ssh.json

  • FQDN_TO_IPMI_FILE Path to the JSON file mapping FQDNs to IPMI username and passwords example in settings.py default ipmi.json

  • FQDN_TO_PDU_FILE Path to the JSON file mapping FQDNs to pdu SNMP sockets example in settings.py default pdus.json

  • FQDN_TO_XEN_FILE Path to the JSON file mapping FQDNs to Xen VM UUIDs example in settings.py default xen.json

Note: there is a bug for simplifying the FQDN_TO_* settings

Testing Actions

To list available actions/management commands:

docker run --name roller-runner --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py

Type 'manage.py help <subcommand>' for help on a specific subcommand.

Available subcommands:

[api]
    file_bugzilla_bug
    ilo_reboot
    ipmi_reboot
    ipmitool
    ping
    reboot
    register_tc_actions
    snmp_reboot
    ssh_reboot
    xenapi_reboot

To show help for one:

docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py ping --help
usage: manage.py ping [-h] [--version] [-v {0,1,2,3}] [--settings SETTINGS]
                      [--pythonpath PYTHONPATH] [--traceback] [--no-color]
                      [-c COUNT] [-w TIMEOUT] [--configuration CONFIGURATION]
                      host

Tries to ICMP ping the host. Raises for exceptions for a lost packet or
timeout.

positional arguments:
  host                  A host

optional arguments:
  -h, --help            show this help message and exit
...
  -c COUNT              stop after sending NUMBER packets
  -w TIMEOUT            stop after N seconds
...

And test it:

docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py ping -c 4 -w 5 localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.086 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.074 ms

--- localhost ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3141ms
rtt min/avg/max/mdev = 0.042/0.069/0.086/0.016 ms

In general, we should be able to run tasks as a manage.py commands and tasks should do the same thing when run as commands as via the API.

Note: bug for not requiring redis to run management commands

Adding a new machine or VM

  1. Create an ssh key and user limited to shutdown or reboot with ForceCommand on the target hardware
  2. Add the ssh key and user to the mounted worker ssh keys directory
  3. Add the machine's FQDN to any relevant FQDN_TO_* config files

Registering Actions with Taskcluster

  1. Check that the TASK_NAMES settings only includes tasks we want to register with Taskcluster
  2. Check TASKCLUSTER_CLIENT_ID and TASKCLUSTER_ACCESS_TOKEN are present as env vars or in settings (via taskcluster-cli login) The client will need the Taskcluster scope queue:declare-provisioner:$provisioner_id#actions
  3. Run:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py register_tc_actions https://roller-dev1.srv.releng.mdc1.mozilla.com my-provisioner-id

Note: An arg like --settings relops_hardware_controller.settings or --configuration Dev may be necessary to use the right Taskcluster credentials

Note: This does not need to be run from the roller server since the first argument is the URL to Taskcluster to send the action.

  1. Check the action shows up in the Taskcluster dashboard for a worker on the provisioner e.g. https://tools.taskcluster.net/provisioners/my-provisioner-id/worker-types/dummy-worker-type/workers/test-dummy-worker-group/dummy-worker-id (this might require creating a worker)
  2. Run the action from the worker's Taskcluster dashboard

Development

This is similar to prod deployment, but uses make, docker-compose, and env files to simplify starting and running things.

To build and run the web server development mode and have the worker reload and purge the queue on file changes run:

make start-web start-worker

To run tests and watch for changes:

make current-shell  # requires the start-web / the web server to be running
docker-compose exec web bash
app@ca6a901df6b4:~$ ptw .

Running: py.test .
=========================================================== test session starts ============================================================
platform linux -- Python 3.6.3, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
Django settings: relops_hardware_controller.settings (from environment variable)
rootdir: /app, inifile: pytest.ini
plugins: flake8-0.9.1, django-3.1.2, celery-4.1.0
collected 74 items
...

Adding a HW management task

  1. Create relops_hardware_controller/api/management/commands/<command_name>.py and tests/test_<command_name>_command.py e.g. ping.py and test_ping_command.py
  2. Run make shell then ./manage.py and check for the command in the api section of the output
  3. Add the command name to TASK_NAMES in relops_hardware_controller/settings.py to make it accessible via API
  4. Add any required shared secrets like ssh keys to the settings.py or .env-dist
  5. register the action with taskcluster