There are a four basic deployment scenarios that are supported by this repository:
- in the first scenario (the default scenario if the playbook in this repository is used as is), we'll walk through the deployment of a multi-node Cassandra cluster in an AWS environment
- in the second scenario, we'll discuss how the configuration used in the first scenario can be modified to perform the same deployment to an OpenStack environment instead
- in the third scenario, we will walk through the process of deploying Cassandra in an AWS or OpenStack environment where the target nodes have already been created and tagged. The command used will essentially be the same as those used in the first two scenarios, but some of the parameters from the vars/cassandra.yml and/or config.yml files may be ignored. We will discuss this further when describing this third scenario
- finally, in the fourth scenario, we will describe how a static inventory file can be used to control the playbook run, targeting the nodes specified in the
cassandra
andcassandra_seed
host groups defined in that static inventory file. While this deployment scenario is supported by the playbook in this repository, it is only supported for non-cloud environments (i.e. scenarios where we are not deploying to an AWS or OpenStack environment).
In the first two scenarios, the virtual machines needed for the multi-node Cassandra deployment will be created and the OS deployed to each of those instances will be configured using the input variables and configuration files. This configuration process (referred to as the preflight
process in the playbook run) involves placing each instance on the proper networks, building a filesystem on the attached data volume (if a data volume was defined), and mounting that data volume on the appropriate mount point. Once that preflight process is complete, the playbook then deploys Cassandra to each of the target nodes and configures those nodes as a multi-node Cassandra cluster.
The default behavior when deploying to either an AWS or OpenStack environment is to control the deployment using a dynamic inventory that is built from the matching nodes in that target environment (i.e. the nodes that have been tagged with tags that match the tag values input as part of the playbook run).
In this scenario, we show how a Cassandra cluster can be built in an AWS environment where a matching set of nodes does not yet exist. In this scenario, the provision-cassandra.yml playbook will actually perform several tasks:
- create the target nodes and build the Cassandra and Cassandra Seed host groups: since the target nodes do not exist yet, the playbook will use the aws role to create a new set of nodes that will be targeted by the remainder of the playbook run; once these nodes are created, the host groups containing the information needed to connect to those nodes are created using the build-app-host-groups role
- configure those target nodes: once created, the target nodes must be configured properly based on the input parameters to the playbook run; this can include placing them on the proper subnet (or subnets), creating a data volume if one is defined, adding a filesystem to that data volume, and mounting that data volume in the defined endpoint; all of these tasks are accomplished by calling out to the preflight role
- deploying Cassandra to the target nodes (first to the seed nodes, then to the non-seed nodes: once the target nodes have been created and configured properly, the playbook then deploys Cassandra to those target nodes and configures them as a multi-node Cassandra cluster
These tasks are actually handled in a single command, where the tasks outlined above are handled in four separate plays within the provision-cassandra.yml playbook. The size of the cluster that will be built is governed by the node_map
parameter, which has a value defined in the default configuration file (the vars/cassandra.yml file) that looks like this:
node_map:
- { application: cassandra, role: seed, count: 3 }
- { application: cassandra, count: 3 }
So, by default, the playbook in this repository will build a six-node, multi-node Cassandra cluster (with three seed nodes and three non-seed nodes). Of course there may be multiple clusters or ensembles in the target environment, so we also need a way to identify which nodes should be targeted by our playbook run. This is accomplished by setting a series of values for the cloud
, tenant
, project
, dataflow
, domain
, and cluster
tags. In the default configuration file used by the provision-cassandra.yml playbook (the config.yml file), these parameters are defined as follows:
cloud: aws
tenant: datanexus
project: demo
domain: production
the tags shown here must be provided with a value for all deployments; the dataflow
and cluster
tags (which are not included in the default configuration file) will assume their default values (none
and a
, respectively) if a value is not provided during any given playbook run.
In addition to modifying the values for these tags (and providing values for the dataflow
and/or cluster
tags), this same configuration file can be used to customize the values for any of the other parameters that are used during the deployment run. In this example, let's assume that we want to customize the configuration in order to:
- set values for the
cluster
anddataflow
tags used when creating new instances in our AWS environment (toa
andpipeline
, respectively) - customize the
node_map
parameter used during the playbook run to deploy an eight node cluster instead of a six node cluster - change the
type
used for the nodes we're deploying in our AWS environment so thatt2.small
nodes are used (the default is to uset2.large
nodes) - define a specific
image
to use when creating new nodes (theami-f4533694
image in this example; keep in mind that in an AWS environment, these image IDs are regional, so the value used in one region will not work in another; it should also be noted here that if this value is not specified in an AWS deployment, the playbook will search for an appropriate image to use based on the defined region) - define non-default sizes for the root and data volumes
- shrink the maximum heap size of the JVM used to run Cassandra on each node to 1GB so that it can run in a
t2.small
instance and optimize it for storage using spinning disks (the default is to optimize for storage using SSDs)
To make these changes, let's assume that we create a new configuration file, let's call it the config-aws-custom.yml
file, containing definitions for all of these new parameter values. The resulting file will look something like this:
$ cat config-aws-custom.yml
---
# cloud type and VM tags
cloud: aws
tenant: datanexus
project: demo
domain: production
dataflow: pipeline
cluster: a
# define a custom node_map value, increasing the node
# count to seven nodes (from the default value of five)
node_map:
- { application: cassandra, role: seed, count: 3 }
- { application: cassandra, count: 5 }
# network configuration used when creating VMs
cidr_block: '10.10.0.0/16'
internal_subnet: '10.10.1.0/24'
external_subnet: '10.10.2.0/24'
# username used to access the system via SSH
user: 'centos'
# AWS-specific parameters
region: us-west-2
# here we override the default value for the 'type'
# and specify an image (the 'image' value defined here will only
# work for the 'us-west-2' region)
type: 't2.small'
image: ami-f4533694
# these two parameters can be used to set a size (in gigabytes) for
# the root and data volumes, respectively
root_volume: 11
data_volume: 100
# custom Cassandra configuration parameters
cassandra_jvm_heaps_size: "1G"
cassandra_disk_optimization_strategy: spinning
$
Notice that we have redefined many of the same parameters that are defined in the default configuration file in this custom configuration file. This is necessary because this file will be used instead of the default configuration file.
Now that we've created a custom configuration file for our deployment, we can use that file to create a new (seven-node) Cassandra cluster in the defined AWS target environment by running a command that looks something like this:
$ AWS_PROFILE=datanexus_west ansible-playbook -e "{ \
config_file: 'config-aws-custom.yml' \
}" provision-cassandra.yml
This process of modifying the default behavior of the playbook run is discussed further (and in more detail) in this document.
It should also be noted that if there are already a set of tagged nodes running in the target environment that match the set of tags passed into the playbook run, then the configuration parameters associated with creating nodes in the defined cloud (the node_map
, type
, image
, cidr_block
, internal_subnet
, external_subnet
, root_volume
, and data_volume
) will be silently ignored and the matching nodes that were found (based on the tags that were passed in) will be targeted for provisioning as a new Cassandra cluster rather than creating a new set of nodes. As such, care should be taken when defining the set of tags that are passed into the playbook run (the tenant
, project
, dataflow
, domain
, and cluster
tags) to ensure that the combination of tags used provide a unique identifier for the Cassandra cluster that is being created in the targeted AWS environment.
As an aside, it should be noted here that the provision-cassandra.yml playbook includes a shebang line at the beginning of the playbook file. As such, the playbook can be executed directly (as if it were a shell script instead of a YAML file) rather than using the file as the final argument in an ansible-playbook
command. This means that the same command that was shown above could be run like this:
$ AWS_PROFILE=datanexus_west ./provision-cassandra.yml -e "{ \
config_file: 'config-aws-custom.yml' \
}"
This form is available as a replacement for any of the ansible-playbook
commands that we show in any of our application playbooks. In most of the examples shown in this documentation, the second form will be used rather than the first; which form you choose to is really just a matter of personal preference (since both accomplish the same thing).
This scenario is basically the same as the previous scenario; the only difference is in the set of parameters that must be defined for this playbook run to succeed. In addition to the parameters that were shown in the custom configuration file we used in our AWS deployment example (above), values must be provided for a few additional parameters that are only needed when creating and configuring the VMs for a new Cassandra cluster in an OpenStack environment. Here's an example of what the resulting (OpenStack) custom configuration file might look like:
$ cat config-osp-custom.yml
---
# cloud type and VM tags
cloud: osp
tenant: datanexus
project: demo
domain: production
dataflow: pipeline
cluster: a
# define a custom node_map value, increasing the node
# count to eight nodes (from the default value of six)
node_map:
- { application: cassandra, role: seed, count: 3 }
- { application: cassandra, count: 5 }
# configuration (internal) and accessible (external) networks
internal_uuid: 849196dc-12e9-4d50-ac52-2e753d03d13f
internal_subnet: 10.10.0.0/24
external_uuid: 92321cab-1a9c-49d6-99a1-62eedbbc2171
external_subnet: 10.10.3.0/24
# the name of the float_pool that the floating IPs assigned to
# the external subnet should be pulled from
float_pool: external
# username used to access the system via SSH
user: cloud-user
# a few configuration options related to the OpenStack
# environment we are targeting
region: region1
zone: az1
# the node type and image to use for our deployment
type: prod.small
image: a36db534-5f9d-4a9f-bdd4-8726cc613b4e
# optional; used to define the size of the root and data
# volumes for each node (in gigabytes)
root_volume: 11
data_volume: 100
# the name of the block_pool that the data volumes should be
# allocated from; required if the data_volume parameter is
# defined (as is the case in this example)
block_pool: prod
# custom Cassandra configuration parameters
cassandra_jvm_heaps_size: "1G"
cassandra_disk_optimization_strategy: spinning
As you can see, this file differs slightly from the configuration file that was shown previously (in the AWS deployment example that was shown above); specifically, in this example configuration file:
- the
cloud
parameter has been set toosp
instead ofaws
- the
region
parameter has been set to a value ofregion1
; this is an example only, and the value that you should use for this parameter will vary depending on the specific OpenStack environment you are targeting (talk to your OpenStack administrator to obtain the value you should use for this parameter) - a
zone
parameter has been added and set toaz1
; this is an example only, and the value that you should use for this parameter will vary depending on the specific OpenStack environment you are targeting (talk to your OpenStack administrator to obtain the value that you should use for this parameter) - the
user
parameter has been set to a value ofcloud-user
; this is an example only, and the value that you should use for this parameter will vary depending on the specific OpenStack environment you are targeting (talk to your OpenStack administrator to obtain the value that you should use for this parameter) - the
image
parameter has been set to the UUID of the image that should be used for our deployments; this value will need to be changed to reflect your OpenStack deployment since it must match one of on the images available to you in that environment (it is assumed to point to a CentOS7 or RHEL7 image) - in addition to the
internal_subnet
andexternal_subnet
parameters that we defined in the AWS deployment example (above), a corresponding UUID must be specified that identifies the network that each of those subnets are associated with (theinternal_uuid
andexternal_uuid
values, respectively); these UUID values were not necessary for AWS deployments but they are needed for an OpenStack deployment - a
float_pool
value must correspond to the name of the pool that the floating IP addresses assigned to each of the instances we are creating should be drawn from. - if a
data_volume
value is provided (as is the case in this example), a correspondingblock_pool
parameter must also be defined that specifies the name of the block pool that those data volumes should be allocated from
It should be noted that while the zone
parameter shown in this OSP deployment example was not specified in our AWS deployment example, the machines that were created in the AWS deployment example were definitely deployed to a specific availability zone within the VPC targeted by our playbook run. That's because the availability zone being targeted was inferred from the subnet information that was passed into the playbook run in that AWS deployment example. In OpenStack, the modules we are using do not allow us to make that inference, so the availability zone we are targeting must be specified for our playbook run to succeed.
Once that custom configuration file has been created, the command to create the requested set of VMs, configure them on the right networks, attach the requested data volumes, and configure them as a multi-node Cassandra cluster is actually quite simple (assuming that you have already configured your local shell so that you have command-line access to the targeted OpenStack environment):
$ ./provision-cassandra.yml -e "{ \
config_file: 'config-osp-custom.yml' \
}"
This should result in the deployment of an eight-node Cassandra cluster in the targeted OpenStack region and availability zone (again, provided that an existing set of VMs that matched the input tenant
, project
, dataflow
, domain
, application
, and cluster
tags were not found in the targeted OpenStack environment).
In this section we will repeat the multi-node cluster deployments that were just shown in the previous two scenarios, but with a small difference. In this scenario, we will assume that a set of VMs already exists in the targeted AWS or OpenStack environment that we would like to use for our Cassandra cluster. This makes the configuration file that we will use much simpler, since the hard work of actually building and configuring the VMs (placing them on the right network, attaching the necessary data volume, etc.) has already been done. The playbook run will be reduced to the following two tasks:
- building the Cassandra and Cassandra Seed host groups: these host groups will be built dynamically from the inventory available in the targeted AWS or OpenStack environment, based on the tags that are passed into the playbook run
- deploying Cassandra to the target nodes: once the matching nodes have been identified, the playbook deploys Cassandra to those nodes and configures them as a multi-node Cassandra cluster
To accomplish this, the we have to:
- tag the existing VMs that we will be using to build our new Cassandra cluster with the appropriate
Cloud
,Tenant
,Project
,Dataflow
,Domain
,Application
, andCluster
tags - place those VMs on the appropriate subnets (i.e. the
10.10.1.0/24
and10.10.2.0/24
subnets) in the appropriate VPC (a VPC with a10.10.0.0/16
CIDR block) in theus-west-2
EC2 region - run a playbook command similar to the commands shown in the previous two scenarios (depending on whether the target environment is an AWS or OpenStack environment); this will deploy Cassandra to the target nodes and configure them as a multi-node Cassandra cluster.
In terms of what the command looks like, lets assume for this example that we've tagged our target nodes in an AWS with the following VM tags:
- Cloud: aws
- Tenant: datanexus
- Project: demo
- Dataflow: none
- Domain: production
- Application: cassandra
- Cluster: a
The tags shown in this example have been chosen specifically to match the tags in the default configuration file (the config.yml file), which looks like this:
$ cat config.yml
# (c) 2017 DataNexus Inc. All Rights Reserved
#
# Defaults used if a configuration file is not passed into the playbook run
# using the `config_file` extra variable; deploys to an AWS environment
---
# cloud type and VM tags
cloud: aws
tenant: datanexus
project: demo
domain: production
# network configuration used when creating VMs
cidr_block: '10.10.0.0/16'
internal_subnet: '10.10.1.0/24'
external_subnet: '10.10.2.0/24'
# username used to access the system via SSH
user: 'centos'
# AWS-specific parameters
region: us-west-2
# node type and image ID to deploy; note that the image ID is optional for AWS
# deployments but mandatory for OSP deployments; if left undefined in an AWS
# deployment then then the playbook will search for an appropriate image to use
type: 't2.large'
#image: 'ami-f4533694'
# the default data and root volume sizes
root_volume: 11
data_volume: 40
and the network configuration and EC2 region mentioned earlier were chosen to match the network configuration and EC2 region that are defined as default values in this file. If you want to override these values in your own playbook run (to target machines on a different set of subnets, in a different VPC, and/or in a different region) then those parameters should be added to a custom configuration file and that configuration file should be used instead of the default configuration file shown above.
Assuming that we have created our VMs in the us-west-2
EC2 region in the proper VPC, placed them on the proper subnets within that VPC, and tagged them appropriately, then the command needed to deploy Cassandra to those nodes would be quite simple:
$ AWS_PROFILE=datanexus_west provision-cassandra.yml
This command would read in the tags from the default configuration file (shown above), find the matching VMs in the us-west-2
EC2 region, and use those machines as the target nodes for the Cassandra deployment (deploying Cassandra to those nodes and configuring them as a multi-node Cassandra cluster).
The playbook command used to deploy Cassandra to a set of target nodes in an OpenStack environment would look quite similar:
$ provision-cassandra.yml -e "{ \
config_file: config-osp.yml \
}"
where the configuration file we are passing in would look nearly the same as the default configuration file used in the AWS deployment:
$ cat config-osp.yml
---
# cloud type and VM tags
cloud: osp
tenant: datanexus
project: demo
domain: production
# configuration (internal) and accessible (external) networks
internal_uuid: 849196dc-12e9-4d50-ac52-2e753d03d13f
internal_subnet: 10.10.0.0/24
external_uuid: 92321cab-1a9c-49d6-99a1-62eedbbc2171
external_subnet: 10.10.3.0/24
# the name of the float_pool that the floating IPs assigned to
# the external subnet should be pulled from
float_pool: external
# username used to access the system via SSH
user: cloud-user
# a few configuration options related to the OpenStack
# environment we are targeting
region: region1
zone: az1
# the node type and image to use for our deployment
type: prod.small
image: a36db534-5f9d-4a9f-bdd4-8726cc613b4e
# optional; used to define the size of the root and data
# volumes for each node (in gigabytes)
root_volume: 11
data_volume: 100
# the name of the block_pool that the data volumes should be
# allocated from; required if the data_volume parameter is
# defined (as is the case in this example)
block_pool: prod
# path to a directory containing the private keys for this deployment
private_key_path: '/tmp/keys'
As you can see, this file doesn't differ significantly from the default configuration file that was shown previously, other than:
- changing the value of the
cloud
parameter toosp
fromaws
- modifying the default value for the
user
parameter fromcentos
tocloud-user
; the value shown here is meant to be an example, and the actual value for this parameter may differ in your specific OpenStack deployment (check with your OpenStack administrator to see what value you should be using) - adding an image UUID (to identify the image that should be used for the deployment); this parameter is required for OpenStack deployments, since there is no way to search for an appropriate image to use if it is not specified
- adding a few OpenStack-specific parameters to the configuration file (the
internal_uuid
,external_uuid
,float_pool
, andblock_pool
values) - setting the
private_key_path
to a directory that contains the keys needed to access instances in your OpenStack environment; by default this parameter is set to the playbook directory, so if you keep the keys needed to SSH into the target instances in your OpenStack environment in the top-level directory of the repository this parameter is not necessary
As was the case with the AWS deployment example shown previously, we are assuming here that a set of nodes already exists in the targeted OpenStack environment and that those nodes have been tagged appropriately. The matching set of nodes will then be used to construct the Cassandra host group, and that set of nodes will be targeted by the playbook run.
If you are using this playbook to deploy a multi-node Cassandra cluster to a non-cloud environment (i.e. an environment other than an AWS or OpenStack environment), then a static inventory file must be passed into the playbook to control which nodes will be targeted by the playbook run and to define how Ansible should connect to those nodes. In this example, let's assume that we are targeting six nodes in our playbook run and that those nodes are already up and running with the IP addresses 192.168.34.72, 192.168.34.73, 192.168.34.74, 192.168.34.75, 192.168.34.76, and 192.168.34.77. Furthermore, let's assume that we would like to setup the first three nodes as the seed nodes for our cluster and the remaining nodes as the non-seed nodes. In that case, the static inventory file that we will be using might look something like this:
$ cat test-cluster-inventory
# example inventory file for a clustered deployment
192.168.34.72 ansible_ssh_host=192.168.34.72 ansible_ssh_port=22 ansible_ssh_user='cloud-user' ansible_ssh_private_key_file='keys/test_cluster_private_key'
192.168.34.73 ansible_ssh_host=192.168.34.73 ansible_ssh_port=22 ansible_ssh_user='cloud-user' ansible_ssh_private_key_file='keys/test_cluster_private_key'
192.168.34.74 ansible_ssh_host=192.168.34.74 ansible_ssh_port=22 ansible_ssh_user='cloud-user' ansible_ssh_private_key_file='keys/test_cluster_private_key'
192.168.34.75 ansible_ssh_host=192.168.34.75 ansible_ssh_port=22 ansible_ssh_user='cloud-user' ansible_ssh_private_key_file='keys/test_cluster_private_key'
192.168.34.76 ansible_ssh_host=192.168.34.76 ansible_ssh_port=22 ansible_ssh_user='cloud-user' ansible_ssh_private_key_file='keys/test_cluster_private_key'
192.168.34.77 ansible_ssh_host=192.168.34.77 ansible_ssh_port=22 ansible_ssh_user='cloud-user' ansible_ssh_private_key_file='keys/test_cluster_private_key'
[cassandra_seed]
192.168.34.72
192.168.34.73
192.168.34.74
[cassandra]
192.168.34.75
192.168.34.76
192.168.34.77
$
For those not familiar with the structure of a static inventory file in Ansible, you can quite clearly see that it is an INI-style file that describes all of the parameters that are needed to connect to each of the target nodes, as well as any host groups that are used during the playbook run. In this example, we have three nodes that we are targeting, and for each node we have defined the IP address that Ansible should use when making an SSH connection to that host, the port and username that should be used for that SSH connection, and the private key file that should be used to authenticate that user. In addition, at the bottom of the example file shown above we have defined the cassandra_seed
and cassandra
host groups, which contain the set of nodes that will be targeted by our playbook run.
To deploy Cassandra to the six nodes in the cassandra_seed
and cassandra
host groups and configure those to work together as an cluster, we could run a command that looks something like this:
$ provision-cassandra.yml -i test-cluster-inventory -e "{ \
config_file: /dev/null, cassandra_data_dir: '/data'
}"
Alternatively, rather than passing in the value for the cassandra_data_dir
parameter and any other parameters needed during our playbook run as extra variables on the command-line, we can make use of the configuration file support that is built into our application deployment playbooks and construct a YAML file that looks something like this containing the configuration parameters that are being used for this deployment:
---
cassandra_data_dir: '/data'
we could then pass that that configuration file into the playbook run as an argument to the ansible-playbook
command; assuming the YAML file shown above was in the current working directory and was named test-config.yml
, the resulting command would look something like this:
$ provision-cassandra.yml -i test-cluster-inventory -e "{ \
config_file: 'test-config.yml' \
}"
Once that playbook run is complete, we can SSH into one of our nodes and run the nodetool
command to view the status of each of the nodes our cluster:
$ /opt/apache-cassandra/bin/nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.34.72 99.6 KiB 32 29.2% 6685ca6a-bf8d-41d1-924b-187279a076a6 rack1
UN 192.168.34.73 65.86 KiB 32 31.0% 3e556a60-b440-4f2f-9047-aa70eb7b387c rack1
UN 192.168.34.74 65.86 KiB 32 30.6% 6746f789-71c0-40c9-9564-d705e9b43732 rack1
UN 192.168.34.75 99.61 KiB 32 41.6% 01bbe5f2-fac3-443f-97dd-7ce170ecde6c rack1
UN 192.168.34.76 99.6 KiB 32 37.1% 01216722-d2a6-4bfa-ae6e-b76481470828 rack1
UN 192.168.34.77 99.61 KiB 32 30.5% f4d3ac7d-0ec8-4b47-8811-c7fb09e8a04b rack1
$
As you can quite clearly see, the ansible-playbook
command shown aqbove deployed Cassandra to all six of our nodes and configured them as a single cluster. Of course, additional nodes could be added to this cluster in a subsequent playbook run; a topic that we will cover in the last scenario (below).