Objective: Deploy a Docker Swarm from scratch on AWS where
- there is persistent storage via GlusterFS.
- there is secrets management via Vault.
- Get connection to newly created AWS account, output availability zones, etc.
- Provision two Ubuntu EC2s in default VPC with Terraform.
- Provision a cluster of two Ubuntu machines with Ansible so that they run a Swarm cluster.
- Be able to deploy a hello-world service and Swarm Visualizer on the swarm.
- Provision a separate GlusterFS as underlying data store.
- Make sure that the system is resilient to taking arbitrary nodes down.
- It must be possible to scale up by just adding a new Swarm node (eg. EC2 instance).
- It must be possible to scale down by removing a Swarm node.
- It must be possible to incrementally upgrade the size of Swarm nodes and/or the Gluster bricks.
- Provision a docker container that can simulate backup of files from a whitelist.
$ cd terraform
$ export AWS_PROFILE=lundogbendsen
$ terraform apply
Copy the cluster-nodes
from the Terraform output to ../ansible/inventory.ini
, the first host as master, the rest as workers:
$ cd ../ansible
$ edit inventory.ini
To avoid specifying the key all the time, put the location of the private key in ansible/ansible.cfg
like
[defaults]
private_key_file = /Users/jps/.ssh/ec2-lundogbendsen-jp.pem
Prepare shared data storage:
$ export ANSIBLE_HOST_KEY_CHECKING=False
$ ansible-playbook -i inventory.ini staging-01-prepare-shared-data-storage.yml
Connect the nodes into a Trusted Storage Pool and start the cluster:
$ ansible-playbook -i inventory.ini staging-11-create-swarm-master.yml
Copy the token and IP from the output into group_vars/staging-workers
. Then start all workers:
$ ansible-playbook -i inventory.ini staging-12-create-swarm-workers.yml
To see that it is working, make a SSH tunnel (in a new terminal) to one of the nodes:
$ ssh -i ~/.ssh/ec2-lundogbendsen-jp.pem -N -L 2375:/var/run/docker.sock ubuntu@ec2-xx-xx-xx-xx.eu-west-1.compute.amazonaws.com
And check that the cluster is running:
$ cd ..
$ export DOCKER_HOST=tcp://localhost:2375
$ docker node ls
$ docker stack deploy -c docker-compose.yml mytest
$ docker stack rm mytest
Increase the number of nodes in terraform/variables.tf
and run terraform plan
to verify that everything looks ok. Then terraform apply
to actually create the new instance.
Add the IPs of the newly created instance to ansible/inventory.ini
group staging-new-workers
: (It should be empty initially)
$ cd ../ansible
$ edit inventory.ini
Then provision only the new instance:
$ ansible-playbook -i inventory.ini -l staging-new-workers staging-01-prepare-shared-data-storage.yml
$ ansible-playbook -i inventory.ini staging-11-add-worker-to-existing-swarm.yml
Now that the new host is ready to be a worker, copy it in ansible/inventory.ini
from group staging-new-workers
to group staging-workers
, and add it (only) to the swarm:
$ ansible-playbook -i inventory.ini -l staging-new-workers staging-12-create-swarm-workers.yml
Finally remove the new hosts from ansible/inventory.ini
group staging-new-workers
.
Start by changing the Terraform spec and do a terraform plan
to find out which host would be taken down if applied. Them use that host in the following.
First drain the swarm node:
$ docker node ls
$ docker node update --availability drain 0hl18vnlus36szv9cvafpiwze
Then SSH into that particular node, and make it leave the swarm: $ ssh ... $ docker swarm leave $ exit
Remove the stopped node from the swarm:
$ docker node rm 0hl18vnlus36szv9cvafpiwze
Then, remove the brick from the Gluster volume:
$ ssh ...
$ sudo gluster volume remove-brick swarm replica 2 ec2-34-244-9-221.eu-west-1.compute.amazonaws.com:/data/gluster/swarm/brick0 force
Finally take the server down.
- On each host: sudo touch /var/lib/cloud/instance/locale-check.skip
- Only open
- 7946 TCP/UDP for container network discovery.
- 4789 UDP for the container ingress network.
- 49152 TCP for GlusterFS. See https://www.jamescoyle.net/how-to/457-glusterfs-firewall-rules
- How to bypass host fingerprint warning from Ansible? ANSIBLE_HOST_KEY_CHECKING=False
- Check out ansible
--with-registry-auth
. - Ansible role for common Apt stuff: HTTPS, update.
To see the state of services:
$ docker service ls
To see full error messages
$ docker service ps --no-trunc mytest_visualizer
$ sudo gluster peer status
$ sudo gluster volume info myvol
$ sudo gluster volume status myvol
$ sudo gluster volume status myvol detail
$ sudo gluster volume heal myvol info
$ sudo gluster volume start swarm force
"Already part of a volume": https://www.jamescoyle.net/how-to/2234-glusterfs-error-volume-add-brick-failed-pre-validation-failed-on-brick-is-already-part-of-a-volume