Skip to content

A general purpose framework for automating Cloudera Products

License

Notifications You must be signed in to change notification settings

Dalamar32/cloudera-deploy

 
 

Repository files navigation

Cloudera Deploy

1. Automation For the Cloudera Data Platform

Cloudera Deploy is a toolset for deploying the Cloudera Data Platform (CDP). Its scope includes Public Cloud and Private Cloud products, Private Cloud Base clusters, and application setup, execution, and other post-deployment functions.

You can use Cloudera Deploy as your entrypoint for getting started with CDP. The toolset uses straightforward configuration definitions to instruct the automation functions, yet is extensible and highly configurable. The toolset can be a great foundation for custom entrypoints, CI/CD pipelines, and development environments.

2. Quickstart

2.1. Prerequisites

2.1.1. Install Docker

Cloudera-Deploy bundles nearly all the software dependencies you need into a convenient Docker Container, so first you will need to get the latest version of Docker Engine.

⚠️
Be sure you uninstall any earlier versions of Docker, i.e. docker, and install the latest version, i.e. docker-ce. See Install Docker Engine for futher details.
💡
If you have not used Docker before, consider following their quick Tutorial to validate it is working and familiarise yourself with the interface

2.1.2. (Optional) Install Git

ℹ️
Git is required if you intend to clone the software for local editing, if you just intend to Run the automation tools you may skip this step.

There are excellent instructions for installing Git on all Operating Systems on the Git website

2.1.3. (Optional) Install AWS CLI

If you are going to be working with AWS, you will want the latest version of the AWS CLI.

ℹ️
The Quickstart image prepackages the AWS CLI, so it is optional to also install it locally

If this is the first time you are installing the AWS CLI, configure the program by providing your credentials, and test that your credentials work

aws configure
aws iam get-user

Visit the AWS CLI User Guide for further details regarding credential management.

2.1.4. (Optional) Install CDP CLI

Get the latest version of the CDP CLI.

ℹ️
The Quickstart image prepackages the CPI CLI, so it is optional to also install it locally

If this is the first time you are installing the CDP CLI, you will need to configure the program by providing the right credentials, and should then test that your credentials work.

cdp configure
cdp iam get-user

Visit the CDP CLI User Guide for further details regarding credential management.

Ensure that you have a generated SSH keypair for your local profile. Visit the SSH Keygen How-To for details.

ℹ️
The Quickstart will generate an SSH keypair if none is provided.

Ensure that you have a properly configured SSH Agent. Visit the SSH Agent How-To for details.

2.2. Setup

2.2.1. Option 1: Download the Quickstart script

The quickstart.sh script will set up the Docker container with the software dependencies you need for deployment.

curl https://raw.githubusercontent.com/cloudera-labs/cloudera-deploy/main/quickstart.sh -o quickstart.sh

2.2.2. Option 2: Clone the repository

Clone this, i.e. the cloudera-deploy, repository, which contains the quickstart.sh script.

git clone https://github.com/cloudera-labs/cloudera-deploy.git
cd cloudera-deploy
⚠️
You are advised not to modify any of the files in the project as a user of the software. The vast majority of changes are managed through configurations provided to these project files.

2.2.3. Confirm your Docker service

Check that Docker is running by running the command to list running Docker containers

docker ps -a

If it is not running, please check your prerequisites process for Docker to install, start, and test the service.

2.2.4. Execute the Quickstart script

Run the quickstart.sh entrypoint script. This script will prepare and execute the Ansible Runner container.

chmod +x quickstart.sh
./quickstart.sh

2.2.5. Confirm the Quickstart environment

Confirm that you have the orange cldr (build)-(version) #> prompt.
This is your interactive Ansible Runner environment and provides builtin access to the relevant dependencies for CDP.

Do NOT run the example definition until you have made the changes below.

2.2.6. Setup your user profile

Modify your local cloudera-deploy user profile. Your profile is present in your $HOME directory under ~/.config/cloudera-deploy/profiles/default.

vim ~/.config/cloudera-deploy/profiles/default
Properties to change
  • Recommended

    • admin_password: Note the password requirements (see the profile template comments).

    • name_prefix: Note the namespace requirements (see the profile template comments).

    • infra_type: The valid values are aws, gcp, azure.

    • infra_region: Region is dependent on the value provided in infra_type.

  • Optional

⚠️
Please ensure you provide a valid region for your selected Cloud provider for the infra_type property.

2.3. Execution

2.3.1. Check your Credentials

Before running a Deployment, it is good practice to check that the credentials available to the Automation software are functioning correctly and match the expected accounts - generally it is good practice to compare the user and account IDs produced in the terminal match those found in the Browser UI.

CDP

If you are deploying CDP Public, check your credential is available in your profile

cdp iam get-user
💡
If you do not yet have a CDP Public credential, follow the Cloudera Documentation here
AWS

If you are using AWS cloud infrastructure, check your credential is available in your profile

aws iam get-user
Azure

If you are using Azure cloud infrastructure, check you are logged into your account and your credentials are available

az account list
💡
If you cannot list your Azure accounts, consider using az login to refresh your credential
GCP

If you are using GCP cloud infrastructure, check your service account credential is being picked up.

⚠️
You need a provisioning Service Account for GCP setup in your cloudera-deploy user profile 'gcloud_credential_file' entry. If you do not yet have a Provisioning Service Account you can follow this process in the CDP Documentation to generate one.
gcloud auth list

2.3.2. Run the main playbook

Run the main playbook with the defaults and your configuration at the orange cldr prompt.

ℹ️
This will create a ' CDP sandbox', which is both a CDP Public Environment and CDP Private Base cluster using your default Cloud Infrastructure Provider credentials. Many other deployments are possible and explained elsewhere.
ansible-playbook /opt/cloudera-deploy/main.yml -e "definition_path=examples/sandbox" \
    -t run,default_cluster -vvv

2.3.3. View the Ansible execution logs

The logs are present at $HOME/.config/cloudera-deploy/log/latest-<currentdate>

tail -100f $HOME/.config/cloudera-deploy/log/latest-2021-05-08_150448
The total time to deploy varies from 90 to 150 minutes, depending on CDN, network connectivity, etc. Keep checking the logs; if there are no errors, the scripts are working in the background.

2.4. Upgrade

Cloudera-Deploy is regularly updated by the maintainers with new features and fixes.
The quickstart.sh script will check for an updated Container image to use if there is currently no Container running.
You may use the following process to trigger this behavior.

⚠️
This will close any active cldr sessions you may have running.

Stop the cloudera-deploy Docker Container

docker stop cloudera-deploy
⚠️
If you have made local uncommitted changes to cloudera-deploy, you must resolve them before updating

In the cloudera-deploy directory, pull the latest changes with git

git fetch --all
git pull

Finally, rerun the quickstart to download the latest image.

💡
You can stop the Docker Container and rerun the quickstart at any time to download the latest image
./quickstart.sh

3. Project Details

🔥
Don’t change the project configuration without getting comfortable with the quickstart a few times.
ℹ️
Below pages will be migrated to Github pages shortly.

Cloudera Deploy is powered by Ansible and provides a standard configuration and execution model for CDP deployments and their applications. It can be run within a container, or directly on a host.

Specifically, Cloudera Deploy is an Ansible project that uses a set of playbooks, roles, and tags to construct a runlevel-like management experience for cloud and cluster deployments. It leverages several collections, both Cloudera and third-party.

3.1. Software Dependencies

Cloudera Deploy requires a number of host applications, services, and Python libraries for its execution. These dependencies are already packaged for ease-of-use in Cloudera Labs Ansible-Runner, another project within Cloudera Labs, and are made readily accessible through the quickstart.sh script.

Alternatively, and especially if you plan on running Cloudera Deploy in your own environment, you may install the dependencies yourself.

3.1.1. Collections and Roles

Cloudera Deploy relies directly on a number of Ansible collections:

And roles:

  • geerlingguy.postgresql

  • ansible-role-mysql

These collection dependencies can be found in the ansible.yml file in the cldr-runner project.

Cloudera Deploy does have a single dependency for its own execution, the community.crypto collection. To install all of these dependencies, you can run the following:

# Get the cldr-runner dependency file first
curl https://raw.githubusercontent.com/cloudera-labs/cldr-runner/main/payload/deps/ansible.yml \
    --output requirements.yml

# Install the collections (and their dependencies)
ansible-galaxy collection install -r requirements.yml

# Install the roles
ansible-galaxy role install -r requirements.yml

# Install the crypto collection
ansible-galaxy collection install community.crypto

3.1.2. Python and Clients

The supporting Python libraries and other clients can be installed using the various dependencies files in the cldr-runner project directly. You might find it easier to follow the installation instructions for cloudera.exe and cloudera.cluster, the two collections that drive this set of dependencies.

For the community.crypto collection dependency, you will need to ensure that the ssh-keygen executable is on your Ansible controller.

The dependencies cover the full range of the automation tooling, from infrastructure on public or private cloud to the relevant Cloudera platform assets. If you are only working with a limited part of the tooling, then you may not need the full list of dependencies. e.g., if you are only working with AWS infrastructure, it is safe to only install those dependencies or use the tagged cldr-runner version.

3.2. User Input Dependencies

Cloudera Deploy does require a small set of user-supplied information for a successful deployment. A minimum set of user inputs is defined in a profile file (see the profile.yml template for details). For example, the profile.yml should define your password for the Administrator account of the deployed services, and you should set a unique name_prefix to avoid clashing with other deployments.

The default location for profiles is ~/.config/cloudera-deploy/profiles/. Cloudera Deploy looks for the default file in this directory unless the Ansible runtime variable profile is set, e.g. -e profile=my_custom_profile. Creating additional profiles is simple, and you can use the profile.yml template as your starting point.

3.2.1. CDP Public Cloud

For CDP Public Cloud, you will need an Access Key and Secret set in your user profile. The tooling uses your default profile unless you instruct it otherwise. (See Configuring CDP client with the API access key.)

3.2.2. Cloud Providers

For Azure and AWS infrastructure, the process is similar, and these parameters may likewise be overridden.

For Google Cloud, we suggest you issue a credentials file, store it securely in your profile, and then provide the path to that file in profile.yml, as this works best with both CLI and Ansible Gcloud interactions.

We suggest you set your default infra_type in profile.yml to match your preferred default Public Cloud Infrastructure credentials.

3.2.3. CDP Private Cloud

For CDP Private Cloud you will need a valid Cloudera license file in order to download the software from the Cloudera repositories. We suggest this is stored in your user profile in ~/.cdp/ and set in the profile.yml config file.

If you are also using Public Cloud infrastructure to host your CDP Private Cloud clusters, then you will need those credentials as well.

3.3. Support Matrix

✓ - Supported

O - Support in CDP, but not in Cloudera-Deploy

X - Not Supported in CDP

Experience AWS Azure GCP

Environment (Light Duty)

Environment (Medium Duty)

O

O

O

Data Lake (Light Duty)

Data Lake (Medium Duty)

O

O

O

Data Hub

Data Warehouse

O

X

Data Engineering

O

O

X

Data Flow

X

X

Machine Learning

X

Operational Database

X

4. SSH Host Key Checking

For CDP Private Cloud clusters and other direct inventory scenarios, you will need to manage SSH host key validation appropriate to your specific environment.

By default, the quickstart.sh script explicitly sets the ANSIBLE_HOST_KEY_CHECKING variable to False for ease-of-use with an introductory deployment. However, this setting is not recommended for any other deployment type. For all other deployment types, you should directly manage your SSH host key checking.

A common approach is to create your own "startup" script using the quickstart.sh as a template, and setting the appropriate Ansible SSH configuration variables.

In some scenarios, for example, a reused pool of dynamic hosts within a development Openstack environment, you might wish to manage this control from your host machine’s SSH config file. For example:

# ~/.ssh/config

# Disable host key checking only for your specific environment
Host *.your.development.domain
   StrictHostKeyChecking no

These settings will flow from your host to the Docker container’s environment if you use the quickstart.sh script.

5. Execution

Cloudera Deploy utilizes a single entrypoint playbook — main.yml — that examines the user-provided profile details, a deployment definition, and any optional Ansible tags and then runs the appropriate actions. At minimum, you execute a deployment like so:

ansible-playbook <location of cloudera-deploy>/main.yml \
  -e "definition_path=<absolute or relative directory to main.yml>"
ℹ️
The location defined by definition_path is relative to the location of the main.yml playbook, and can also be an absolute location.

5.1. Tags

Cloudera Deploy exposes a set of Ansible tags that allows fine-grained inclusion and exclusion of functions, in particular, a runlevel-like management process.

Table 1. Partial List of Available Execution Tags

infra

Infrastructure (cloud provider assets)

plat

Platform (CDP Public Cloud Datalakes). Assumes infra.

run

Runtime (CDP Public Cloud experiences, e.g. Cloudera Machine Learning (CML)). Assumes infra and plat.

full_cluster

CDP Private Cloud Base Clusters.

Current Tags: verify_inventory, verify, full_cluster, default_cluster, verify_definition, custom_repo, verify_parcels, database, security, kerberos, tls, ha, os, users, jdk, mysql_connector, oracle_connector, fetch_ca, cm, license, autotls, prereqs, restart_agents, heartbeat, mgmt, preload_parcels, kts, kms, restart_stale, teardown_ca, teardown_all, teardown_tls, teardown_cluster, infra, init, plat, run, validate

With these tags, you can set your deployment to a given "runlevel" state:

# Ensure only the infrastructure layer is available
ansible-playbook main.yml -e "definition_path=my_example" -t infra

or select or skip a level or function:

# Ensure the platform and runtimes are available, but skip any infrastructure
ansible-playbook main.yml -e "definition_path=my_example" -t run --skip-tags infra
⚠️
Setting a deployment to a lower runlevel, e.g. from run to infra will teardown deployed components in the higher runlevels.

For further details on the various runlevel-like tags for CDP Public Cloud, see the Runlevel Guide in the cloudera.exe project.

5.2. Terraform Deployment Engine

Terraform can optionally be used to create the cloud infrastructure. This will attempt to create the cloud provider assets at the infra (network, storage and compute) and plat (IAM policies and roles) runlevels using Terraform resources. A list of Terraform related parameters are shown in the table below.

Table 2. List of parameters used by Terraform deployment engine
Parameter Description Default Value Notes

infra_deployment_engine

The engine (ansible or terraform) that will be used to create the infrastructure resources.

ansible

Needs to be set to terraform for Terraform-deployment.

The parameters below are specified as keys in the terraform dictionary

terraform.base_dir

Top-level directory where all Terraform assets will be placed. Includes processed Jinja template files for Terraform, timestamped artefact of Terraform files and the workspace directory where terraform apply/destroy is run.

~/.config/cloudera-deploy/terraform

terraform.state_storage

The type of backend storage to use for the Terraform state.

local

Current options are local or remote_s3

terraform.auto_remote_state

Flag to allow Cloudera Deploy automatically provision remote state resources as part of its initialization. This will also teardown these resources during cleanup.

False

terraform.remote_state_bucket

The name of the Terraform state storage bucket.

Required if using remote_s3 state storage. Value is derived from name_prefix if terraform_auto_remote_state is True.

terraform.remote_state_lock_table

The name of the table to track locks of remote Terraform state.

Required if using remote_s3 state storage. Value is derived from name_prefix if terraform_auto_remote_state is True.

6. Definitions

Cloudera Deploy uses a set of configuration files within a directory to define and coordinate a deployment. This directory also stores any artifacts created during the deployment, such as Ansible inventory files, CDP environment readouts, etc.

The main.yml entrypoint playbook expects the runtime variable definition_path which should point at the absolute or relative (to the playbook) directory hosting these configuration files.

Within the directory, you must supply the following files:

  • definition.yml

  • application.yml

Optionally, if deploying a CDP Private Cloud cluster or need to set up adhoc IaaS infrastructure, you can supply the following :

  • inventory_static.ini

  • inventory_template.ini

The definition directory can host any other file or asset, such as data files, additional configuration details, additional playbooks. However, Cloudera Deploy will not operate unless the definition.yml and application.yml files are present.

6.1. definition.yml

The required definition.yml file contains top-level configuration keys that define and direct the deployment.

Table 3. Top-Level Configuration Keys

infra

Hosting infrastructure to manage

env

CDP Public Cloud Environment deployment (on the infrastructure)

clusters

CDP Private Cloud Cluster deployment (on the Infrastructure)

mgmt

hosts

Within the top-level keys, you may override the defaults appropriate to that section.

You may also add other top-level configuration keys if your automation requires it, e.g. if your application.yml playbook needs its own configuration details.

More detailed documentation of all the options is beyond the scope of this introductory readme; further documentation is forthcoming.

6.2. application.yml

The required application.yml file is not a configuration file, it is actually an Ansible playbook. At minimum, this playbook requires a single Ansible play; a basic no-op task works well if you wish to take no additional actions beyond the core deployment.

For more sophisticated post-deployment actitivies, you can expand this playbook as much as needed. For example, the playbook can interact with hosts and inventory, execute computing jobs on deployment environments, and include additional playbooks and configuration files.

ℹ️
This file is a standard Ansible playbook, and when it is executed (via import_playbook) by the main.yml entrypoint, the working directory of the Ansible executable is changed to the directory of the application.yml playbook.

6.3. inventory_static.ini

You may also include an inventory_static.ini file that describes your static Ansible inventory. This file will be automatically loaded and added to the Ansible inventory. Note that you can also use the standard Ansible -i switch to include other static inventory.

6.4. inventory_template.ini

If included, Cloudera Deploy will use a definition’s inventory_template.ini file, which describes a set of dynamic host inventory, and provision these hosts as infrastructure for the deployment, typically for a CDP Private Cloud cluster.

ℹ️
This currently only works on AWS.

7. Getting Involved

Contribution instructions are coming soon!

Copyright 2021, Cloudera, Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

About

A general purpose framework for automating Cloudera Products

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%