# hashicorp-services/consul-in-prod = Reference Implementation Book
This book provides to customer-facing implementers hands-on end-to-end instructions about automation to create -- for production usage -- secure and highly-available implementations of HashiCorp Consul with Vault common within Global 2000 enterprises.
The most common usage of Consul is a "Global Network Mesh" which is a superset of the industry concept of "Service Mesh". Unlike others which operate only in Kubernetes or only in AWS, Consul provides multi-region reliability, multi-cloud flexibility, and multi-platform adaptability.
So this document is focused how to create such amazing capabilities.
Contents:
-
What's inside - Table of Contents
-
Construction Stages (part of Adoption Plan)
-
Proof of Production Viability (part of Reliability Plan)
Additional pages this summary page links to, alphabetically:
- Client Configuration
- CI/CD DevSecOps
- Impact Assessment
- Observability (Log aggregation and Dashboard analytics)
- Project Management to ensure inclusion during fast work
- Server Configuration
- Solution Design
- BestPractices.md identified from interviews of each persona, based on the Well-Architected Framework
- HashiCorp Presales Solution Engineering (Kyle Rarey, Ram Ramhariram, Nathan Pearce)
- SE SME (Ancil McBarnett)
- HashiCorp PS (Professional Services)
- HashiCorp Implementation Services (Austin Workman, )
- HashiCorp CS SEs (Customer Service Engineers) (Josh Wolfer)
- Consultants within HashiCorp services partners (Gabe)
- HashiCorp Field CTOs (Jake Lundberg)
This is being collaboratively developed and maintained by the above plus these stakeholders: TODO: Get the org names correct!
- Domain Architecture (Wilson Mar, Frank Hane)
- PSE (Iman, John Boero, Jim Sullivan)
- Education (Tu Nguyen, Daniele Carcasole)
- Operations Experience
- (Tony Pulickal)
- (Matt Peters)
- Reference Architecture (Chloe Cota)
- SE (segment leaders in US, EMEA, APJ)
- Field (Thomas Kula)
- CSA
- CSM
- IS
- Consul Marketing PMM (Van Phan)
- Consul Product Management (Usha Kodali, Abhishek Tiwari)
- Dev Evangelists (Rosemary Wang)
Governance:
- Hari
- Joe Weber
NOTE: This document presents best practices and tools which individual practioners are free to adjust as they see fit for each situation.
During a Consul Accelerator Program (CAP) engagement, this document is modified to the customer.
QUESTION: Who should be included?
QUESTION: Who grants access to services partners to this org/repo on GitBook?
Consul is available in several editions :
- Free Open-Source (at https://github.com/hashicorp/consul)
- Licensed Enterprise (installed, configured, and managed by enterprise customers)
- Licensed HCP cloud Consul (using HVN setup within the customer's app infrastructure)
Because most enterprises want support contracts, this document is focused on enterprise use in production, and does not cover setting up of an individual stand-alone Consul cluster for purpose of learning.
The HashiCorp cloud edition of Consul, Enterprise, and Terraform peered is used for "Cloud To Ground" peering connection to on-prem. servers maintained by enterprise customers.
The baseline Solution Design here assumes, for reliability, use of 5 nodes per Consul datacenter across 3 Availability Zones (each a separate VPC) within each region.
TODO: Add performance nodes to this Consul single-datacenter/region Reference Architecture.
To ensure production-level reliability at Enterpise scale, each implementation here also addresses two regions peered together.
This HA decison tree from Microsoft :
<a target="_blank" href="https://docs.microsoft.com/en-us/azure/architecture/example-scenario/infrastructure/media/ha-decision-tree.png>
An app is needed for Consul to manage. This document makes use of the HashiCups sample app maintained by HashiCorp.
Although Consul works with multiple platform technologies, a Linux-based sample e-commerce application (HashiCups?) running in Kubernetes with a server node for each of these APIs:
- front-end web server
- product
- shipment
- payment (external)
(Not included are Redis cache, Elastisearch (parse logs), mail/SMS, ratings, observability, analytics, etc.)
TODO: Create a diagram to add in the Azure Reference Architecture diagrams with HA/DR
Alternative HA-capable apps:
https://docs.microsoft.com/en-us/azure/architecture/example-scenario/infrastructure/wordpress
A key capability of Consul is that, as each app service is instantiated, Consul detects it and adds it to its Service Registry, then securely route network traffic to them, even across disparate platforms (via a Consul API Gateway).
Consul provides the "glue" to services across the enterprise.
Within enterprises, each app exists in a sea of other apps and systems.
In each app server node, a Consul sidecar enables a Consul Global Network Mesh which directs L4 (network level 4) traffic across multiple clouds operating different platform technologies:
-
A. Among app nodes within the same app cluster
B. Database (MySQL, PostgreSQL, Oracle) outside Kubernetes
C. AWS EC2 image running in AWS
D. AWS ECS (Elastic Container Service) VID
E. AWS EKS (Elastic Kubernetes Service) with Service Mesh
F. VMWare registry image of hypervisor communicating with Kubernetes
G. Pivotal Cloud Foundary
H. RedHat OpenShift
I. Serverless (AWS Lambda, Azure & GCP Functions) running within clouds
- The Consul Docker container is retrieved into Kubernetes.
Thus, Consul "future proofs" how enterprises securely manage networking between services.
This document defines the features of Consul and Vault implemented :
-
Centralized identity-based authentication for users and service accounts (with SSO and MFA using Okta OIDC IdP and other Authentication Methods)
-
Centralized secrets management (using Vault)
-
Encryption of app data in transit and at rest (using Vault)
-
Segregation of app data into different data Namespaces to reduce access to data in case of breach (which we assume will occur)
-
Admin partitions in a centralized service registry in the Consul Key/Value store - (this is the core feature desired by most enterprises)
-
Segmentation of network traffic to restrict network access in case of breach (which we assume will occur)
-
Layer 7 Traffic Managment for Canary testing, A/B tests, blue/green deploys, and soft multi-tenancy (instead of "East/West" Load Balancers among microservices)
-
Use of Access Control Lists (ACLs tokens) to enforce least-privilege access
-
Read replicas to ensure performance and reliability as systems scale
-
Automated backups to ensure quick and reliable recovery in case of disaster at app, cloud Availability Zone, and cloud region levels (for MTTR to meet RPO and RTO standards)
-
Autopilot Enterprise Redundancy Zones to improve resiliency and scaling by adding “non-voting” servers which will be promoted to voting status in case of voting server failure.
-
Automated upgrades of versions
The objective of implementing Consul and Vault features is to enable Optimal achievement of what is defined in the Zero-Trust Maturity Model, SOC2, ISO 27000, and other security frameworks along with the Well-Architected Framework.
Manual steps to create each implementation, explained like the Imagina GitBook) are logically organized into these sequential stages (using automated means where applicable):
TODO: Create a flowchart such as this
-
Project Management - tools and processes to ensure inclusion during organized and fast work
-
Questions for interviewing each persona, based on the comprehensive Well-Architected Framework
-
Design solution settings - decide on values of variables for datacenter, region, values for variables, abbreviations (such as Azure's)
-
Define least-privilege Roles - the actions allowed/disallowed for each persona (in role files)
-
Establish Auth Methods - Okta, etc. for SSO and MFA by each user
-
Establish cloud accounts - with special Administrator access used during setup and regular accounts. This being Enterprise, we assume use of multiple cloud accounts.
-
Establish cloud admin account for managing backup data (assuming breach of other accounts)
-
Laptop setup - on each builder laptop: XCode, Homebrew, wget, tree, Jinja, VSCode, Git, GPG, Vault, Consul, Terraform, Packer, Docker, Docker Compose, etc.
-
GitHub setup - with SSH and GPG certificates
-
Clone GitHub template repos - each of the Reference Implementation components
-
Establish Observability systems (log Log/SIEM, Dashboards, etc.) used during troubleshooting and managerial reviews
-
Establish the App/system and databases described above
-
Run security scans - on Terraform while on laptop (secret detection, TFSec, etc.)
Run GitHub Actions or CircleCI CI/CD which invoke bootstraping Bash shell scripts and Terraform which load variables, create folders, bootstrap, configure to reboot automatically, etc.
-
Obtain Enterprise licenses - from a HashiCorp Customer Success Solution Engineer
-
Establish Vault - install and configure
-
Establish Consul per Features listed above (with segrated namespaces, segmented networks, read replicas, automated backup, etc.)
-
Define Intentions and ACLs - using Consul to manage the sample application
-
Estimate costs - (using Terracost)
-
Maintain system
The list above provides a way to measure progress of the entire project.
There are different construction steps for each cloud and platform (defined below).
This presents procedures and automation for creating Consul within each cloud:
Scripts to install assets for Instruqt labs may be used in production scripts.
-
AWS is the most popular cloud.
-
https://github.com/hashicorp-services/accelerator-aws-consul/ (internal) by Implementation Services (Kyle Rarey, Austin Workman) contains scripts used on customer sites
-
https://github.com/hashicorp/terraform-aws-consul-starter by the Operations Experience team (Omar Khawaja and Sara Chandler) is deprecated.
-
-
Google (GCP) - HashiCorp's hands-on Instruqt labs run on GCP.
-
https://github.com/hashicorp/field-workshops-consul by Thomas Kula (PreSales Solutions Engineering) has slides for aws, azure, gcp, multi-cloud. Has an instructor guide to Instruqt tracks.
-
https://github.com/hashicorp/learn-instruqt contains source files for interactive scenarios at https://learn.hashicorp.com
-
https://github.com/hashicorp-services/enablement-consul-instruqt
-
https://github.com/hashicorp-services/enablement-vault-instruqt
-
https://github.com/hashicorp-services/accelerator-gcp-consul/ doesn't exist yet
-
-
Azure
-
https://github.com/hashicorp-services/accelerator-azure-consul/ doesn't exist yet
-
https://github.com/hashicorp/terraform-azure-consul-ent-starter by the Operations Experience team (Omar Khawaja and Sara Chandler) is deprecated.
-
NOTE: We aim to structure our implementation scripts to make it easier to customize across different clouds.
-
Inside AWS created using Terraform:
https://github.com/hashicorp-services/accelerator-aws-consul -
On-prem using Ansible Playbooks:
https://github.com/hashicorp-services/ansible-role-consul/tree/aworkman_testing
References:
Slide decks from https://hashicorp.github.io/workshops/ and https://github.com/hashicorp/field-workshops-consul
-
Intro to Consul Enterprise - A two hour introductory workshop. (Sales/Consul Team or FM Team)
-
Life of a Developer with Consul Enterprise - A two hour container based progressive application delivery workshop. (Sales/Consul Team)
-
Network Infrastructure Automation with Consul Enterprise - A two hour network acceleration workshop with Terraform and Consul-Terraform-Sync. (Sales/Consul Team)
-
Multi-Cloud Service Networking with Consul Enterprise - An operational half day multi-product Zero Trust workshop across AWS, Azure, and GCP. (Sales/Consul Team)
Each implementation has an edition/variation for each technical platform:
-
MacOS is commonly used on laptop clients used by developers and administrators.
-
Linux (Ubuntu RedHat)
Our objective is that, after initial install and configuration, minimal manual intervention be required to keep Consul running.
To save time, enable collaboration, ensure repeatability, and avoid mistakes, the Consul features above are instantiated using modern DevSecOps principles:
-
Self-serve Shift-left for developer efficiency (TFSec on desktops)
-
Version control code with code reviews (in GitHub)
-
Automated CI/CD (GitHub Actions) for speed and comprehensive security scanning. TODO: Adadpt example from Azure:
-
Infrastructure-as-Code HashiCorp Terraform Enterprise Workspaces invoked by Bash shell scripts.
-
Dockerized (Docker Compose) from a Docker Registry
-
Automated policy enforcement to replace long waits for manual approvals
-
Helm charts
-
Azure/Terraform-provider-azapi VIDEO
We also provide procedures and automation to prove that production-grade mechanisms can actually respond effectively to various stresses (planned and unplanned outages):
-
ACL blocks access to those not authorized
-
As each new service comes online, Consul "discovers" them
-
Nodes auto-starts and re-joins its cluster when appropriate
-
When a service becomes unhealthy, Consul no longer routes traffic to it
-
When ACL and Intentions are changed, access is immediately restricted or allowed
-
Data associated with each change is propagated (via Serf Gossip protocol) among nodes
-
Each change that occurs in Consul results in notification of network management systems (Palo Alto via NIA CTS)
-
When a Follower node is no longer functional, others continue working
-
When the Leader node is no longer functional, the Raft protocol establishes a new Leader
-
Additional cloud resources are added or removed as demand rises or drops significantly
-
Logs can be filtered for troubleshooting (using Datadog and other observability tools)
-
Dashboards (using Grafana) display analytics about key trends
-
Alerts are issued when trouble is recognized (due to injection of troubling events)