diff --git a/README.md b/README.md index 957fa6407d..a84157db2c 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,8 @@ Choose one of the following tutorials: * [Deploy TiDB by launching an AWS EKS cluster](./docs/aws-eks-tutorial.md) +* [Deploy TiDB Operator and TiDB Cluster on Alibaba Cloud Kubernetes](./deploy/alicloud/README.md) + * [Deploy TiDB in the minikube cluster](./docs/minikube-tutorial.md) ## User guide diff --git a/deploy/alicloud/.gitignore b/deploy/alicloud/.gitignore new file mode 100644 index 0000000000..50525a24e0 --- /dev/null +++ b/deploy/alicloud/.gitignore @@ -0,0 +1,6 @@ +.terraform/ +credentials/ +terraform.tfstate +terraform.tfstate.backup +.terraform.tfstate.lock.info +rendered/ diff --git a/deploy/alicloud/README-CN.md b/deploy/alicloud/README-CN.md new file mode 100644 index 0000000000..5b72583469 --- /dev/null +++ b/deploy/alicloud/README-CN.md @@ -0,0 +1,103 @@ +# 在阿里云上部署 TiDB Operator 和 TiDB 集群 + +## 环境需求 + +- [aliyun-cli](https://github.com/aliyun/aliyun-cli) >= 3.0.15 并且[配置 aliyun-cli](https://www.alibabacloud.com/help/doc-detail/90766.htm?spm=a2c63.l28256.a3.4.7b52a893EFVglq) +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.12 +- [helm](https://github.com/helm/helm/blob/master/docs/install.md#installing-the-helm-client) >= 2.9.1 +- [jq](https://stedolan.github.io/jq/download/) >= 1.6 +- [terraform](https://learn.hashicorp.com/terraform/getting-started/install.html) 0.11.* + +> 你可以使用阿里云的 [云命令行](https://shell.aliyun.com) 服务来进行操作,云命令行中已经预装并配置好了所有工具。 + +## 概览 + +默认配置下,我们会创建: + +- 一个新的 VPC; +- 一台 ECS 实例作为堡垒机; +- 一个托管版 ACK(阿里云 Kubernetes)集群以及一系列 worker 节点: + - 属于一个伸缩组的 2 台 ECS 实例(1核1G), 托管版 Kubernetes 的默认伸缩组中必须至少有两台实例, 用于承载整个的系统服务, 比如 CoreDNS + - 属于一个伸缩组的 3 台 `ecs.i2.xlarge` 实例, 用于部署 PD + - 属于一个伸缩组的 3 台 `ecs.i2.2xlarge` 实例, 用于部署 TiKV + - 属于一个伸缩组的 2 台 ECS 实例(16核32G)用于部署 TiDB + - 属于一个伸缩组的 1 台 ECS 实例(4核8G)用于部署监控组件 + - 一块 500GB 的云盘用作监控数据存储 + +除了默认伸缩组之外的其它所有实例都是跨可用区部署的。而伸缩组(Auto-scaling Group)能够保证集群的健康实例数等于期望数值,因此,当发生节点故障甚至可用区故障时,伸缩组能够自动为我们创建新实例来确保服务可用性。 + +## 安装 + +设置目标 Region 和阿里云密钥(也可以在运行 `terraform` 命令时根据命令提示输入) +```shell +export TF_VAR_ALICLOUD_REGION= +export TF_VAR_ALICLOUD_ACCESS_KEY= +export TF_VAR_ALICLOUD_SECRET_KEY= +``` + +使用 Terraform 进行安装: + +```shell +$ git clone https://github.com/pingcap/tidb-operator +$ cd tidb-operator/deploy/alicloud +$ terraform init +$ terraform apply +``` + +整个安装过程大约需要 5 至 10 分钟,安装完成后会输出集群的关键信息(想要重新查看这些信息,可以运行 `terraform output`),接下来可以用 `kubectl` 或 `helm` 对集群进行操作: + +```shell +$ export KUBECONFIG=$PWD/credentials/kubeconfig_ +$ kubectl version +$ helm ls +``` + +并通过堡垒机连接 TiDB 集群进行测试: + +```shell +$ ssh -i credentials/bastion-key.pem root@ +$ mysql -h -P -u root +``` + +## 升级 TiDB 集群 + +设置 `variables.tf` 中的 `tidb_version` 参数,运行 `terraform apply` 即可完成升级。 + +## TiDB 集群水平伸缩 + +设计 `variables.tf` 中的 `tikv_count` 和 `tidb_count`,运行 `terraform apply` 即可完成 TiDB 集群的水平伸缩。 + +## 销毁集群 + +```shell +$ terraform destroy +``` + +> 注意:监控组件挂载的云盘需要手动删除。 + +## 监控 + +访问 `` 就可以查看相关的 Grafana 看板。 + +> 出于安全考虑,假如你已经或将要给 VPC 配置 VPN,强烈推荐将 `monitor_slb_network_type` 设置为 `intranet` 来禁止监控服务的公网访问。 + +## 自定义 + +默认配置下,Terraform 脚本会创建一个新的 VPC,假如要使用现有的 VPC,可以在 `variable.tf` 中设置 `vpc_id`。注意,当使用现有 VPC 时,没有设置 vswitch 的可用区将不会部署 kubernetes 节点。 + +出于安全考虑,TiDB 服务的 SLB 只对内网暴露,因此默认配置下还会创建一台堡垒机用于运维操作。堡垒机上还会安装 mysql-cli 和 sysbench 以便于使用和测试。假如不需要堡垒机,可以设置 `variables.tf` 中的 `create_bastion` 参数来关闭。 + +实例的规格可以通过两种方式进行定义: + +1. 通过声明实例规格名; +2. 通过声明实例的配置,比如 CPU 核数和内存大小。 + +由于阿里云在不同地域会提供不同的规格类型,并且部分规格有售罄的情况,我们推荐使用第二种办法来定义更通用的实例规格。你可以在 `variables.tf` 中找到相关的配置项。 + +特殊地,由于 PD 和 TiKV 节点强需求本地 SSD 存储,脚本中不允许直接声明 PD 和 TiKV 的规格名,你可以通过设置 `*_instance_type_family` 来选择 PD 或 TiKV 的规格族(只能在三个拥有本地 SSD 的规格族中选择),再通过内存大小来筛选符合需求的型号。 + +更多自定义配置相关的内容,请直接参考项目中的 `variables.tf` 文件。 + +## 限制 + +目前,pod cidr, service cid 和节点型号等配置在集群创建后均无法修改。 diff --git a/deploy/alicloud/README.md b/deploy/alicloud/README.md new file mode 100644 index 0000000000..f23da0e85c --- /dev/null +++ b/deploy/alicloud/README.md @@ -0,0 +1,109 @@ +# Deploy TiDB Operator and TiDB Cluster on Alibaba Cloud Kubernetes + +[中文](README-CN.md) + +## Requirements + +- [aliyun-cli](https://github.com/aliyun/aliyun-cli) >= 3.0.15 and [configure aliyun-cli](https://www.alibabacloud.com/help/doc-detail/90766.htm?spm=a2c63.l28256.a3.4.7b52a893EFVglq) +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.12 +- [helm](https://github.com/helm/helm/blob/master/docs/install.md#installing-the-helm-client) >= 2.9.1 +- [jq](https://stedolan.github.io/jq/download/) >= 1.6 +- [terraform](https://learn.hashicorp.com/terraform/getting-started/install.html) 0.11.* + +> You can use the Alibaba [Cloud Shell](https://shell.aliyun.com) service, which has all the tools pre-installed and properly configured. + +## Overview + +The default setup will create: + +- A new VPC +- An ECS instance as bastion machine +- A managed ACK(Alibaba Cloud Kubernetes) cluster with the following ECS instance worker nodes: + - An auto-scaling group of 2 * instances(1c1g) as ACK mandatory workers for system service like CoreDNS + - An auto-scaling group of 3 * `ecs.i2.xlarge` instances for PD + - An auto-scaling group of 3 * `ecs.i2.2xlarge` instances for TiKV + - An auto-scaling group of 2 * instances(16c32g) for TiDB + - An auto-scaling group of 1 * instance(4c8g) for monitoring components + +In addition, the monitoring node will mount a 500GB cloud disk as data volume. All the instances except ACK mandatory workers span in multiple available zones to provide cross-AZ high availability. + +The auto-scaling group will ensure the desired number of healthy instances, so the cluster can auto recover from node failure or even available zone failure. + +## Setup + +Configure target region and credential (you can also set these variables in `terraform` command prompt): +```shell +export TF_VAR_ALICLOUD_REGION= +export TF_VAR_ALICLOUD_ACCESS_KEY= +export TF_VAR_ALICLOUD_SECRET_KEY= +``` + +Apply the stack: + +```shell +$ git clone https://github.com/pingcap/tidb-operator +$ cd tidb-operator/deploy/alicloud +$ terraform init +$ terraform apply +``` + +`terraform apply` will take 5 to 10 minutes to create the whole stack, once complete, you can interact with the ACK cluster using `kubectl` and `helm`: + +```shell +$ export KUBECONFIG=$PWD/credentials/kubeconfig_ +$ kubectl version +$ helm ls +``` + +Then you can connect the TiDB cluster via the bastion instance: + +```shell +$ ssh -i credentials/bastion-key.pem root@ +$ mysql -h -P -u root +``` + +## Monitoring + +Visit `` to view the grafana dashboards. + +> It is strongly recommended to set `monitor_slb_network_type` to `intranet` for security if you already have a VPN connecting to your VPC or plan to setup one. + +## Upgrade TiDB cluster + +To upgrade TiDB cluster, modify `tidb_version` variable to a higher version in variables.tf and run `terraform apply`. + +## Scale TiDB cluster + +To scale TiDB cluster, modify `tikv_count` or `tidb_count` to your desired count, and then run `terraform apply`. + +## Destroy + +```shell +$ terraform destroy +``` + +> Note: You have to manually delete the cloud disk used by monitoring node after destroying if you don't need it anymore. + +## Customize + +By default, the terraform script will create a new VPC. You can use an existing VPC by setting `vpc_id` to use an existing VPC. Note that kubernetes node will only be created in available zones that has vswitch existed when using existing VPC. + +An ecs instance is also created by default as bastion machine to connect to the created TiDB cluster, because the TiDB service is only exposed to intranet. The bastion instance has mysql-cli and sysbench installed that helps you use and test TiDB. + +If you don't have to access TiDB from internet, you could disable the creation of bastion instance by setting `create_bastion` to false in `variables.tf` + +The worker node instance types are also configurable, there are two ways to configure that: + +1. by specifying instance type id +2. by specifying capacity like instance cpu count and memory size + +Because the Alibaba Cloud offers different instance types in different region, it is recommended to specify the capacity instead of certain type. You can configure these in the `variables.tf`, note that instance type will override capacity configurations. + +There's a exception for PD and TiKV instances, because PD and TiKV required local SSD, so you cannot specify instance type for them. Instead, you can choose the type family among `ecs.i1`,`ecs.i2` and `ecs.i2g`, which has one or more local NVMe SSD, and select a certain type in the type family by specifying `instance_memory_size`. + +For more customization options, please refer to `variables.tf` + +## Limitations + +You cannot change pod cidr, service cidr and worker instance types once the cluster created. + diff --git a/deploy/alicloud/ack/data.tf b/deploy/alicloud/ack/data.tf new file mode 100644 index 0000000000..5d9315d773 --- /dev/null +++ b/deploy/alicloud/ack/data.tf @@ -0,0 +1,40 @@ +data "alicloud_zones" "all" { + network_type = "Vpc" +} + +data "alicloud_vswitches" "default" { + vpc_id = "${var.vpc_id}" +} + +data "alicloud_instance_types" "default" { + availability_zone = "${lookup(data.alicloud_zones.all.zones[0], "id")}" + cpu_core_count = "${var.default_worker_cpu_core_count}" +} + +# Workaround map to list transformation, see stackoverflow.com/questions/43893295 +data "template_file" "vswitch_id" { + count = "${var.vpc_id == "" ? 0 : length(data.alicloud_vswitches.default.vswitches)}" + template = "${lookup(data.alicloud_vswitches.default.0.vswitches[count.index], "id")}" +} + +# Get cluster bootstrap token +data "external" "token" { + depends_on = ["alicloud_cs_managed_kubernetes.k8s"] + + # Terraform use map[string]string to unmarshal the result, transform the json to conform + program = ["bash", "-c", "aliyun --region ${var.region} cs POST /clusters/${alicloud_cs_managed_kubernetes.k8s.id}/token --body '{\"is_permanently\": true}' | jq \"{token: .token}\""] +} + +data "template_file" "userdata" { + template = "${file("${path.module}/templates/user_data.sh.tpl")}" + count = "${length(var.worker_groups)}" + + vars { + pre_userdata = "${lookup(var.worker_groups[count.index], "pre_userdata", var.group_default["pre_userdata"])}" + post_userdata = "${lookup(var.worker_groups[count.index], "post_userdata", var.group_default["post_userdata"])}" + open_api_token = "${lookup(data.external.token.result, "token")}" + node_taints = "${lookup(var.worker_groups[count.index], "node_taints", var.group_default["node_taints"])}" + node_labels = "${lookup(var.worker_groups[count.index], "node_labels", var.group_default["node_labels"])}" + region = "${var.region}" + } +} diff --git a/deploy/alicloud/ack/main.tf b/deploy/alicloud/ack/main.tf new file mode 100644 index 0000000000..4d77f696fc --- /dev/null +++ b/deploy/alicloud/ack/main.tf @@ -0,0 +1,146 @@ +/* + Alicloud ACK module that launches: + + - A managed kubernetes cluster; + - Several auto-scaling groups which acting as worker nodes. + + Each auto-scaling group has the same instance type and will + balance ECS instances across multiple AZ in favor of HA. + */ +provider "alicloud" {} + +resource "alicloud_key_pair" "default" { + count = "${var.key_pair_name == "" ? 1 : 0}" + key_name_prefix = "${var.cluster_name}-key" + key_file = "${var.key_file != "" ? var.key_file : format("%s/%s-key", path.module, var.cluster_name)}" +} + +# If there is not specifying vpc_id, create a new one +resource "alicloud_vpc" "vpc" { + count = "${var.vpc_id == "" ? 1 : 0}" + cidr_block = "${var.vpc_cidr}" + name = "${var.cluster_name}-vpc" + + lifecycle { + ignore_changes = ["cidr_block"] + } +} + +# For new vpc or existing vpc with no vswitches, create vswitch for each zone +resource "alicloud_vswitch" "all" { + count = "${var.vpc_id != "" && (length(data.alicloud_vswitches.default.vswitches) != 0) ? 0 : length(data.alicloud_zones.all.zones)}" + vpc_id = "${alicloud_vpc.vpc.0.id}" + cidr_block = "${cidrsubnet(alicloud_vpc.vpc.0.cidr_block, var.vpc_cidr_newbits, count.index)}" + availability_zone = "${lookup(data.alicloud_zones.all.zones[count.index%length(data.alicloud_zones.all.zones)], "id")}" + name = "${format("vsw-%s-%d", var.cluster_name, count.index+1)}" +} + +resource "alicloud_security_group" "group" { + count = "${var.group_id == "" ? 1 : 0}" + name = "${var.cluster_name}-sg" + vpc_id = "${var.vpc_id != "" ? var.vpc_id : alicloud_vpc.vpc.0.id}" + description = "Security group for ACK worker nodes" +} + +# Allow traffic inside VPC +resource "alicloud_security_group_rule" "cluster_worker_ingress" { + count = "${var.group_id == "" ? 1 : 0}" + security_group_id = "${alicloud_security_group.group.id}" + type = "ingress" + ip_protocol = "all" + nic_type = "intranet" + port_range = "-1/-1" + cidr_ip = "${var.vpc_id != "" ? var.vpc_cidr : alicloud_vpc.vpc.0.cidr_block}" +} + +# Create a managed Kubernetes cluster +resource "alicloud_cs_managed_kubernetes" "k8s" { + name = "${var.cluster_name}" + // split and join: workaround for terraform's limitation of conditional list choice, similarly hereinafter + vswitch_ids = ["${element(split(",", var.vpc_id != "" && (length(data.alicloud_vswitches.default.vswitches) != 0) ? join(",", data.template_file.vswitch_id.*.rendered) : join(",", alicloud_vswitch.all.*.id)), 0)}"] + key_name = "${alicloud_key_pair.default.key_name}" + pod_cidr = "${var.k8s_pod_cidr}" + service_cidr = "${var.k8s_service_cidr}" + new_nat_gateway = "${var.create_nat_gateway}" + cluster_network_type = "${var.cluster_network_type}" + slb_internet_enabled = "${var.public_apiserver}" + kube_config = "${var.kubeconfig_file != "" ? var.kubeconfig_file : format("%s/kubeconfig", path.module)}" + worker_numbers = ["${var.default_worker_count}"] + worker_instance_types = ["${var.default_worker_type != "" ? var.default_worker_type : data.alicloud_instance_types.default.instance_types.0.id}"] + + # These varialbes are 'ForceNew' that will cause kubernetes cluster re-creation + # on variable change, so we make all these variables immutable in favor of safety. + lifecycle { + ignore_changes = [ + "vswitch_ids", + "worker_instance_types", + "key_name", + "pod_cidr", + "service_cidr", + "cluster_network_type", + ] + } + + depends_on = ["alicloud_vpc.vpc"] +} + +# Create auto-scaling groups +resource "alicloud_ess_scaling_group" "workers" { + count = "${length(var.worker_groups)}" + scaling_group_name = "${alicloud_cs_managed_kubernetes.k8s.name}-${lookup(var.worker_groups[count.index], "name", count.index)}" + vswitch_ids = ["${split(",", var.vpc_id != "" ? join(",", data.template_file.vswitch_id.*.rendered) : join(",", alicloud_vswitch.all.*.id))}"] + min_size = "${lookup(var.worker_groups[count.index], "min_size", var.group_default["min_size"])}" + max_size = "${lookup(var.worker_groups[count.index], "max_size", var.group_default["max_size"])}" + default_cooldown = "${lookup(var.worker_groups[count.index], "default_cooldown", var.group_default["default_cooldown"])}" + multi_az_policy = "${lookup(var.worker_groups[count.index], "multi_az_policy", var.group_default["multi_az_policy"])}" + + # Remove the newest instance in the oldest scaling configuration + removal_policies = [ + "OldestScalingConfiguration", + "NewestInstance" + ] + + lifecycle { + # FIXME: currently update vswitch_ids will force will recreate, allow updating when upstream support in-place + # vswitch id update + ignore_changes = ["vswitch_ids"] + + create_before_destroy = true + } +} + +# Create the cooresponding auto-scaling configurations +resource "alicloud_ess_scaling_configuration" "workers" { + count = "${length(var.worker_groups)}" + scaling_group_id = "${element(alicloud_ess_scaling_group.workers.*.id, count.index)}" + image_id = "${lookup(var.worker_groups[count.index], "image_id", var.group_default["image_id"])}" + instance_type = "${lookup(var.worker_groups[count.index], "instance_type", var.group_default["instance_type"])}" + security_group_id = "${var.group_id != "" ? var.group_id : alicloud_security_group.group.id}" + key_name = "${alicloud_key_pair.default.key_name}" + system_disk_category = "${lookup(var.worker_groups[count.index], "system_disk_category", var.group_default["system_disk_category"])}" + system_disk_size = "${lookup(var.worker_groups[count.index], "system_disk_size", var.group_default["system_disk_size"])}" + user_data = "${element(data.template_file.userdata.*.rendered, count.index)}" + internet_charge_type = "${lookup(var.worker_groups[count.index], "internet_charge_type", var.group_default["internet_charge_type"])}" + internet_max_bandwidth_in = "${lookup(var.worker_groups[count.index], "internet_max_bandwidth_in", var.group_default["internet_max_bandwidth_in"])}" + internet_max_bandwidth_out = "${lookup(var.worker_groups[count.index], "internet_max_bandwidth_out", var.group_default["internet_max_bandwidth_out"])}" + + enable = true + active = true + force_delete = true + + tags = "${merge(map( + "name", "${alicloud_cs_managed_kubernetes.k8s.name}-${lookup(var.worker_groups[count.index], "name", count.index)}-ack_asg", + "kubernetes.io/cluster/${alicloud_cs_managed_kubernetes.k8s.name}", "owned", + "k8s.io/cluster-autoscaler/${lookup(var.worker_groups[count.index], "autoscaling_enabled", var.group_default["autoscaling_enabled"]) == 1 ? "enabled" : "disabled"}", "true", + "k8s.io/cluster-autoscaler/${alicloud_cs_managed_kubernetes.k8s.name}", "default" + ), + var.default_group_tags, + var.worker_group_tags[count.index%length(var.worker_group_tags)] + ) + }" + + lifecycle { + ignore_changes = ["instance_type"] + create_before_destroy = true + } +} diff --git a/deploy/alicloud/ack/outputs.tf b/deploy/alicloud/ack/outputs.tf new file mode 100644 index 0000000000..dc1f5fabdb --- /dev/null +++ b/deploy/alicloud/ack/outputs.tf @@ -0,0 +1,34 @@ +output "cluster_id" { + description = "The id of the ACK cluster." + value = "${alicloud_cs_managed_kubernetes.k8s.id}" +} + +output "cluster_name" { + description = "The name of ACK cluster" + value = "${alicloud_cs_managed_kubernetes.k8s.name}" +} + +output "cluster_nodes" { + description = "The cluster worker node ids of ACK cluster" + value = "${alicloud_ess_scaling_configuration.workers.*.id}" +} + +output "vpc_id" { + description = "The vpc id of ACK cluster" + value = "${alicloud_cs_managed_kubernetes.k8s.vpc_id}" +} + +output "vswitch_ids" { + description = "The vswich ids of ACK cluster" + value = "${alicloud_cs_managed_kubernetes.k8s.vswitch_ids}" +} + +output "security_group_id" { + description = "The security_group_id of ACK cluster" + value = "${alicloud_cs_managed_kubernetes.k8s.security_group_id}" +} + +output "kubeconfig_filename" { + description = "The filename of the generated kubectl config." + value = "${path.module}/kubeconfig_${var.cluster_name}" +} diff --git a/deploy/alicloud/ack/templates/user_data.sh.tpl b/deploy/alicloud/ack/templates/user_data.sh.tpl new file mode 100644 index 0000000000..572b6afe6f --- /dev/null +++ b/deploy/alicloud/ack/templates/user_data.sh.tpl @@ -0,0 +1,16 @@ +#!/bin/bash -xe + +# Pre userdata code +${pre_userdata} + +# Bootstrap node and join the k8s cluster +curl -o attach_node.sh http://aliacs-k8s-${region}.oss-${region}-internal.aliyuncs.com/public/pkg/run/attach/1.12.6-aliyun.1/attach_node.sh + +# Hack: remove AUTO_FDISK statement to avoid local ssd being formatted +# TODO: add attach_node.sh to project +sed -i '/export AUTO_FDISK=/d' ./attach_node.sh +chmod a+x ./attach_node.sh +./attach_node.sh --ess "true" --openapi-token "${open_api_token}" %{ if node_labels != "" }--labels ${node_labels}%{ endif } %{ if node_taints != "" }--taints ${node_taints}%{ endif } + +# Post userdata code +${post_userdata} diff --git a/deploy/alicloud/ack/variables.tf b/deploy/alicloud/ack/variables.tf new file mode 100644 index 0000000000..db7e4e4a9e --- /dev/null +++ b/deploy/alicloud/ack/variables.tf @@ -0,0 +1,160 @@ +variable "region" { + description = "Alicloud region" +} + +variable "cluster_name" { + description = "Kubernetes cluster name" + default = "ack-cluster" +} + +variable "cluster_network_type" { + description = "Kubernetes network plugin, options: [flannel, terway]. Cannot change once created." + default = "flannel" +} + +variable "span_all_zones" { + description = "Whether span worker nodes in all avaiable zones, worker_zones will be ignored if span_all_zones=true" + default = true +} + +variable "worker_zones" { + description = "Available zones of worker nodes, used when span_all_zones=false. It is highly recommended to guarantee the instance type of workers is available in at least two zones in favor of HA." + type = "list" + default = [] +} + +variable "public_apiserver" { + description = "Whether enable apiserver internet access" + default = false +} + +variable "kubeconfig_file" { + description = "The path that kubeconfig file write to, default to $${path.module}/kubeconfig if empty." + default = "" +} + +variable "k8s_pod_cidr" { + description = "The kubernetes pod cidr block. It cannot be equals to vpc's or vswitch's and cannot be in them. Cannot change once the cluster created." + default = "172.20.0.0/16" +} + +variable "k8s_service_cidr" { + description = "The kubernetes service cidr block. It cannot be equals to vpc's or vswitch's or pod's and cannot be in them. Cannot change once the cluster created." + default = "172.21.0.0/20" +} + +variable "vpc_cidr" { + description = "VPC cidr_block, options: [192.168.0.0.0/16, 172.16.0.0/16, 10.0.0.0/8], cannot collidate with kubernetes service cidr and pod cidr. Cannot change once the vpc created." + default = "192.168.0.0/16" +} + +variable "key_file" { + description = "The path that new key file write to, defaul to $${path.module}/$${cluster_name}-key.pem if empty" + default = "" +} + +variable "key_pair_name" { + description = "Key pair for worker instance, specify this variable to use an exsitng key pair. A new key pair will be created by default." + default = "" +} + +variable "vpc_id" { + description = "VPC id, specify this variable to use an exsiting VPC and the vswitches in the VPC. Note that when using existing vpc, it is recommended to use a existing security group too. Otherwise you have to set vpc_cidr according to the existing VPC settings to get correct in-cluster security rule." + default = "" +} + +variable "group_id" { + description = "Security group id, specify this variable to use and exising security group" + default = "" +} + +variable "vpc_cidr_newbits" { + description = "VPC cidr newbits, it's better to be set as 16 if you use 10.0.0.0/8 cidr block" + default = "8" +} + +variable "create_nat_gateway" { + description = "If create nat gateway in VPC" + default = true +} + +variable "default_worker_count" { + description = "The number of kubernetes default worker nodes, value: [2,50]. See module README for detail." + default = 2 +} + +variable "default_worker_cpu_core_count" { + description = "The instance cpu core count of kubernetes default worker nodes, this variable will be ignroed if default_worker_type set" + default = 1 +} + +variable "default_worker_type" { + description = "The instance type of kubernets default worker nodes, it is recommend to use default_worker_cpu_core_count to select flexible instance type" + default = "" +} + +variable "worker_groups" { + description = "A list of maps defining worker group configurations to be defined using alicloud ESS. See group_default for validate keys." + type = "list" + + default = [ + { + "name" = "default" + }, + ] +} + +variable "group_default" { + description = < 8, default thread pool size for coprocessors + # will be set to tikv.resources.limits.cpu * 0.8. + # readpoolCoprocessorConcurrency: 8 + + # scheduler's worker pool size, should increase it in heavy write cases, + # also should less than total cpu cores. + # storageSchedulerWorkerPoolSize: 4 + +tikvPromGateway: + image: prom/pushgateway:v0.3.1 + imagePullPolicy: IfNotPresent + resources: + limits: {} + # cpu: 100m + # memory: 100Mi + requests: {} + # cpu: 50m + # memory: 50Mi + +tidb: + replicas: ${tidb_replicas} + # The secret name of root password, you can create secret with following command: + # kubectl create secret generic tidb-secret --from-literal=root_password= + # If unset, the root password will be empty and you can set it after connecting + # passwordSecretName: tidb-secret + # initSql is the SQL statements executed after the TiDB cluster is bootstrapped. + # initSql: |- + # create database app; + image: "pingcap/tidb:${cluster_version}" + # Image pull policy. + imagePullPolicy: IfNotPresent + logLevel: info + resources: + limits: {} + # cpu: 16000m + # memory: 16Gi + requests: {} + # cpu: 12000m + # memory: 12Gi + nodeSelector: + dedicated: tidb + # kind: tidb + # zone: cn-bj1-01,cn-bj1-02 + # region: cn-bj1 + tolerations: + - key: dedicated + operator: Equal + value: tidb + effect: "NoSchedule" + maxFailoverCount: 3 + service: + type: LoadBalancer + exposeStatus: true + annotations: + service.beta.kubernetes.io/alicloud-loadbalancer-address-type: intranet + service.beta.kubernetes.io/alicloud-loadbalancer-slb-network-type: vpc + # separateSlowLog: true + slowLogTailer: + image: busybox:1.26.2 + resources: + limits: + cpu: 100m + memory: 50Mi + requests: + cpu: 20m + memory: 5Mi + +# mysqlClient is used to set password for TiDB +# it must has Python MySQL client installed +mysqlClient: + image: tnir/mysqlclient + imagePullPolicy: IfNotPresent + +monitor: + create: true + # Also see rbac.create + # If you set rbac.create to false, you need to provide a value here. + # If you set rbac.create to true, you should leave this empty. + # serviceAccount: + persistent: true + storageClassName: ${monitor_storage_class} + storage: ${monitor_storage_size} + grafana: + create: true + image: grafana/grafana:6.0.1 + imagePullPolicy: IfNotPresent + logLevel: info + resources: + limits: {} + # cpu: 8000m + # memory: 8Gi + requests: {} + # cpu: 4000m + # memory: 4Gi + username: admin + password: admin + service: + type: LoadBalancer + # if grafana is running behind a reverse proxy with subpath http://foo.bar/grafana + # config the `serverDomain` and `serverRootUrl` as follows + # serverDomain: foo.bar + # serverRootUrl: "%(protocol)s://%(domain)s/grafana/" + prometheus: + image: prom/prometheus:v2.2.1 + imagePullPolicy: IfNotPresent + logLevel: info + resources: + limits: {} + # cpu: 8000m + # memory: 8Gi + requests: {} + # cpu: 4000m + # memory: 4Gi + service: + type: NodePort + annotations: + service.beta.kubernetes.io/alicloud-loadbalancer-address-type: ${monitor_slb_network_type} + reserveDays: ${monitor_reserve_days} + # alertmanagerURL: "" + nodeSelector: {} + # kind: monitor + # zone: cn-bj1-01,cn-bj1-02 + # region: cn-bj1 + tolerations: [] + # - key: node-role + # operator: Equal + # value: tidb + # effect: "NoSchedule" + +binlog: + pump: + create: false + replicas: 1 + image: "pingcap/tidb-binlog:${cluster_version}" + imagePullPolicy: IfNotPresent + logLevel: info + # storageClassName is a StorageClass provides a way for administrators to describe the "classes" of storage they offer. + # different classes might map to quality-of-service levels, or to backup policies, + # or to arbitrary policies determined by the cluster administrators. + # refer to https://kubernetes.io/docs/concepts/storage/storage-classes + storageClassName: ${local_storage_class} + storage: 10Gi + # a integer value to control expiry date of the binlog data, indicates for how long (in days) the binlog data would be stored. + # must bigger than 0 + gc: 7 + # number of seconds between heartbeat ticks (in 2 seconds) + heartbeatInterval: 2 + + drainer: + create: false + image: "pingcap/tidb-binlog:${cluster_version}" + imagePullPolicy: IfNotPresent + logLevel: info + # storageClassName is a StorageClass provides a way for administrators to describe the "classes" of storage they offer. + # different classes might map to quality-of-service levels, or to backup policies, + # or to arbitrary policies determined by the cluster administrators. + # refer to https://kubernetes.io/docs/concepts/storage/storage-classes + storageClassName: ${local_storage_class} + storage: 10Gi + # parallel worker count (default 1) + workerCount: 1 + # the interval time (in seconds) of detect pumps' status (default 10) + detectInterval: 10 + # disbale detect causality + disableDetect: false + # disable dispatching sqls that in one same binlog; if set true, work-count and txn-batch would be useless + disableDispatch: false + # # disable sync these schema + ignoreSchemas: "INFORMATION_SCHEMA,PERFORMANCE_SCHEMA,mysql,test" + # if drainer donesn't have checkpoint, use initial commitTS to initial checkpoint + initialCommitTs: 0 + # enable safe mode to make syncer reentrant + safeMode: false + # number of binlog events in a transaction batch (default 1) + txnBatch: 1 + # downstream storage, equal to --dest-db-type + # valid values are "mysql", "pb", "kafka" + destDBType: pb + mysql: {} + # host: "127.0.0.1" + # user: "root" + # password: "" + # port: 3306 + # # Time and size limits for flash batch write + # timeLimit: "30s" + # sizeLimit: "100000" + kafka: {} + # only need config one of zookeeper-addrs and kafka-addrs, will get kafka address if zookeeper-addrs is configed. + # zookeeperAddrs: "127.0.0.1:2181" + # kafkaAddrs: "127.0.0.1:9092" + # kafkaVersion: "0.8.2.0" + +scheduledBackup: + create: false + binlogImage: "pingcap/tidb-binlog:${cluster_version}" + binlogImagePullPolicy: IfNotPresent + # https://github.com/tennix/tidb-cloud-backup + mydumperImage: pingcap/tidb-cloud-backup:latest + mydumperImagePullPolicy: IfNotPresent + # storageClassName is a StorageClass provides a way for administrators to describe the "classes" of storage they offer. + # different classes might map to quality-of-service levels, or to backup policies, + # or to arbitrary policies determined by the cluster administrators. + # refer to https://kubernetes.io/docs/concepts/storage/storage-classes + storageClassName: ${local_storage_class} + storage: 100Gi + # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#schedule + schedule: "0 0 * * *" + # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#suspend + suspend: false + # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#jobs-history-limits + successfulJobsHistoryLimit: 3 + failedJobsHistoryLimit: 1 + # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#starting-deadline + startingDeadlineSeconds: 3600 + # https://github.com/maxbube/mydumper/blob/master/docs/mydumper_usage.rst#options + options: "--chunk-filesize=100" + # secretName is the name of the secret which stores user and password used for backup + # Note: you must give the user enough privilege to do the backup + # you can create the secret by: + # kubectl create secret generic backup-secret --from-literal=user=root --from-literal=password= + secretName: backup-secret + # backup to gcp + gcp: {} + # bucket: "" + # secretName is the name of the secret which stores the gcp service account credentials json file + # The service account must have read/write permission to the above bucket. + # Read the following document to create the service account and download the credentials file as credentials.json: + # https://cloud.google.com/docs/authentication/production#obtaining_and_providing_service_account_credentials_manually + # And then create the secret by: kubectl create secret generic gcp-backup-secret --from-file=./credentials.json + # secretName: gcp-backup-secret + + # backup to ceph object storage + ceph: {} + # endpoint: "" + # bucket: "" + # secretName is the name of the secret which stores ceph object store access key and secret key + # You can create the secret by: + # kubectl create secret generic ceph-backup-secret --from-literal=access_key= --from-literal=secret_key= + # secretName: ceph-backup-secret + +metaInstance: "{{ $labels.instance }}" +metaType: "{{ $labels.type }}" +metaValue: "{{ $value }}" \ No newline at end of file diff --git a/deploy/alicloud/userdata/bastion-userdata b/deploy/alicloud/userdata/bastion-userdata new file mode 100644 index 0000000000..bb3549482f --- /dev/null +++ b/deploy/alicloud/userdata/bastion-userdata @@ -0,0 +1,6 @@ +#cloud-config +packages: +- mysql +runcmd: +- curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | bash +- yum -y install sysbench \ No newline at end of file diff --git a/deploy/alicloud/userdata/pd-userdata.sh b/deploy/alicloud/userdata/pd-userdata.sh new file mode 100644 index 0000000000..0740182f9e --- /dev/null +++ b/deploy/alicloud/userdata/pd-userdata.sh @@ -0,0 +1,15 @@ +#!/bin/sh +# set system ulimits +cat < /etc/security/limits.d/99-tidb.conf +root soft nofile 1000000 +root hard nofile 1000000 +root soft core unlimited +root soft stack 10240 +EOF +# config docker ulimits +cp /usr/lib/systemd/system/docker.service /etc/systemd/system/docker.service +sed -i 's/LimitNOFILE=infinity/LimitNOFILE=1048576/' /etc/systemd/system/docker.service +sed -i 's/LimitNPROC=infinity/LimitNPROC=1048576/' /etc/systemd/system/docker.service +systemctl daemon-reload +systemctl restart docker + diff --git a/deploy/alicloud/userdata/tikv-userdata.sh b/deploy/alicloud/userdata/tikv-userdata.sh new file mode 100644 index 0000000000..af67992d36 --- /dev/null +++ b/deploy/alicloud/userdata/tikv-userdata.sh @@ -0,0 +1,14 @@ +#!/bin/sh +# set system ulimits +cat < /etc/security/limits.d/99-tidb.conf +root soft nofile 1000000 +root hard nofile 1000000 +root soft core unlimited +root soft stack 10240 +EOF +# config docker ulimits +cp /usr/lib/systemd/system/docker.service /etc/systemd/system/docker.service +sed -i 's/LimitNOFILE=infinity/LimitNOFILE=1048576/' /etc/systemd/system/docker.service +sed -i 's/LimitNPROC=infinity/LimitNPROC=1048576/' /etc/systemd/system/docker.service +systemctl daemon-reload +systemctl restart docker diff --git a/deploy/alicloud/variables.tf b/deploy/alicloud/variables.tf new file mode 100644 index 0000000000..e597a917a6 --- /dev/null +++ b/deploy/alicloud/variables.tf @@ -0,0 +1,146 @@ +variable "cluster_name" { + description = "TiDB cluster name" + default = "tidb-cluster" +} + +variable "tidb_version" { + description = "TiDB cluster version" + default = "v2.1.0" +} + +variable "pd_count" { + description = "PD instance count, the recommend value is 3" + default = 3 +} + +variable "pd_instance_type_family" { + description = "PD instance type family, values: [ecs.i2, ecs.i1, ecs.i2g]" + default = "ecs.i2" +} + +variable "pd_instance_memory_size" { + description = "PD instance memory size in GB, must available in the type famliy" + default = 32 +} + +variable "tikv_count" { + description = "TiKV instance count, ranges: [3, 100]" + default = 4 +} + +variable "tikv_instance_type_family" { + description = "TiKV instance memory in GB, must available in type family" + default = "ecs.i2" +} + +variable "tikv_memory_size" { + description = "TiKV instance memory in GB, must available in type family" + default = 64 +} + +variable "tidb_count" { + description = "TiDB instance count, ranges: [1, 100]" + default = 3 +} + +variable "tidb_instance_type" { + description = "TiDB instance type, this variable override tidb_instance_core_count and tidb_instance_memory_size, is recommended to use the tidb_instance_core_count and tidb_instance_memory_size to select instance type in favor of flexibility" + + default = "" +} + +variable "tidb_instance_core_count" { + default = 16 +} + +variable "tidb_instance_memory_size" { + default = 32 +} + +variable "monitor_intance_type" { + description = "Monitor instance type, this variable override tidb_instance_core_count and tidb_instance_memory_size, is recommended to use the tidb_instance_core_count and tidb_instance_memory_size to select instance type in favor of flexibility" + + default = "" +} + +variable "monitor_instance_core_count" { + default = 4 +} + +variable "monitor_instance_memory_size" { + default = 8 +} + +variable "monitor_storage_class" { + description = "Monitor PV storageClass, values: [alicloud-disk-commo, alicloud-disk-efficiency, alicloud-disk-ssd, alicloud-disk-available]" + default = "alicloud-disk-available" +} + +variable "monitor_storage_size" { + description = "Monitor storage size in Gi" + default = 500 +} + +variable "monitor_reserve_days" { + description = "Monitor data reserve days" + default = 14 +} + +variable "create_bastion" { + description = "Whether create bastion server" + default = true +} + +variable "bastion_image_name" { + description = "OS image of bastion" + default = "centos_7_06_64_20G_alibase_20190218.vhd" +} + +variable "bastion_key_prefix" { + default = "bastion-key" +} + +variable "bastion_cpu_core_count" { + description = "CPU core count to select bastion type" + default = 1 +} + +variable "bastion_ingress_cidr" { + description = "Bastion ingress security rule cidr, it is highly recommended to set this in favor of safety" + default = "0.0.0.0/0" +} + +variable "monitor_slb_network_type" { + description = "The monitor slb network type, values: [internet, intranet]. It is recommended to set it as intranet and access via VPN in favor of safety" + default = "internet" +} + +variable "vpc_id" { + description = "VPC id, specify this variable to use an exsiting VPC and the vswitches in the VPC. Note that when using existing vpc, it is recommended to use a existing security group too. Otherwise you have to set vpc_cidr according to the existing VPC settings to get correct in-cluster security rule." + default = "" +} + +variable "group_id" { + description = "Security group id, specify this variable to use and exising security group" + default = "" +} + +variable "vpc_cidr_newbits" { + description = "VPC cidr newbits, it's better to be set as 16 if you use 10.0.0.0/8 cidr block" + default = "8" +} + +variable "k8s_pod_cidr" { + description = "The kubernetes pod cidr block. It cannot be equals to vpc's or vswitch's and cannot be in them. Cannot change once the cluster created." + default = "172.20.0.0/16" +} + +variable "k8s_service_cidr" { + description = "The kubernetes service cidr block. It cannot be equals to vpc's or vswitch's or pod's and cannot be in them. Cannot change once the cluster created." + default = "172.21.0.0/20" +} + +variable "vpc_cidr" { + description = "VPC cidr_block, options: [192.168.0.0.0/16, 172.16.0.0/16, 10.0.0.0/8], cannot collidate with kubernetes service cidr and pod cidr. Cannot change once the vpc created." + default = "192.168.0.0/16" +}