If you can't see your issue on this page come ask us about it on our Community Slack or create an issue.
Control Tower uses BOSH to deploy and manage VMs. When something isn't working right but the cause isn't obvious the best general first steps are:
-
Export bosh credentials
eval "$(control-tower info --iaas [AWS|GCP] --env <deployment-name>)"
-
Check the status of the deployed VMs
bosh vms
Which gives an output like:
Using environment '1.2.3.4' as client 'admin' Task 9312. Done Deployment 'concourse' Instance Process State AZ IPs VM CID VM Type Active Stemcell web/95589e21-09af-412d-abef-a2065fa828fe running z1 1.2.3.4 i-00000000000000000 concourse-web-xlarge true bosh-aws-xen-hvm-ubuntu-bionic-go_agent/1.67 10.0.0.8 worker/17cedb77-a924-4e09-bb1a-952b7e8b3fc6 failing z1 10.0.1.7 i-00000000000000000 concourse-2xlarge true bosh-aws-xen-hvm-ubuntu-bionic-go_agent/1.67 2 vms Succeeded
Look for vms that aren't in the
running
state. -
If you see a VM in a non-running state you can ssh to it with
bosh ssh worker/17cedb77-a924-4e09-bb1a-952b7e8b3fc6
-
Once on the VM become root (you can't do much without it) and check the state of all the processes (BOSH uses
monit
to manage processes)sudo -i monit summary
The Monit daemon 5.2.5 uptime: 5d 12h 12m Process 'worker' running Process 'telegraf' failing Process 'telegraf-agent' running System 'system_a85416b7-ea5c-4889-bed1-20ce16cef76e' running
-
If you see a process that is erroring you can find logs for it (and all other processes) in
/var/vcap/sys/log/<process-name>
-
You can manually restart processes with
monit restart <process-name>
Some other useful bosh commands are:
bosh tasks --all --recent
shows recent tasks including system ones - can show if a VM is flapping and BOSH keeps trying to restart it
You can read more about BOSH troubleshooting in their own documentation.
NATS handles communication between the director VM and the bosh-agent processes that run on each VM that it manages (web and worker(s)). When it expires this communication is no longer possible and any running VMs will appear as unresponsive agent
in bosh vms
.
You can check the expiry of the NATS certs on your Control Tower deployment with:
control-tower info --iaas <AWS|GCP> --region <region> --cert-expiry <deployment-name>
and if it is getting close to expiry you can rotate it with the maintain command.
If the certificate has already expired you will see an error when deploying which resembles:
Deploying:
Creating instance 'bosh/0':
Waiting until instance is ready:
Post https://mbus:<redacted>@<IP>:6868/agent: x509: certificate has expired or is not yet valid
Exit code 1
Solution:
-
Download
director-creds.yml
from the config bucket of your deployment (in S3 or GCS depending on your IAAS) -
Delete all the certs in that file (more info)
Note that each certificate will contain keys for
ca
,private_key
, andcertificate
. You need to delete all three keys for each certificate -
Overwrite the
director-creds.yml
in your bucket with your newly modified one -
Run
control-tower deploy
to force BOSH to generate new certsNote that the Concourse deploy will fail and all the VMs will appear in BOSH as
unresponsive agent
-
Export bosh credentials with
eval "$(control-tower info --iaas [AWS|GCP] --env <deployment-name>)"
-
Run
bosh deploy --recreate --fix <(bosh manifest)
to push the new NATs cert to each vm -
Run
control-tower deploy
which should now run all the way through -
Optionally run the
renew-https-cert
job in thecontrol-tower-self-update
pipeline in your main team to renew the outward facing SSL cert
Further information can be found in the BOSH docs.
If the certificate (the Director API endpoint) has expired then you'll see the following error when interacting with control-tower
which remsembles:
Succeeded
Fetching info:
Performing request GET 'https://<redacted>:25555/info':
Performing GET request:
Retry: Get https://<redacted>:25555/info: x509: certificate has expired or is not yet valid
Exit code 1
exit status 1
You can check the certificate expiry dates using the following command:
echo | openssl s_client -showcerts -connect <director-ip>:25555 | openssl x509 -noout -text
Solution:
- Download
config.json
from the config bucket of your deployment (in S3 or GCS depending on your IAAS), whose name should resemblecontrol-tower-<deployment>-<region>-config
- Delete the
director_ca_cert
,director_cert
anddirector_key
from theconfig.json
file. - Overwrite the
config.json
in your bucket with your newly modified one - Run
control-tower deploy
to force BOSH to generate new certs:
e.g.
control-tower deploy --iaas <AWS or GCP> --region <region> <deployment>
Once the certificate has been regenerated and deployed, you can check with the following command:
echo | openssl s_client -showcerts -connect <director-ip>:25555 | openssl x509 -noout -text