GKE Installer

Fury Kubernetes Installer - Managed Services - GKE - oss project.

GKE vs Compute Instances (managed vs self-managed)

Before continuing, you should understand what are the benefits and drawback of creating a GKE cluster instead of creating your Kubernetes control plane in Google Cloud Compute instances.

Price

GKE currently costs $0.10 per hour for a HA control plane.

An n1-standard-2 compute instance currently costs $0.095 per hour. Having an HA cluster with 3 x n1-standard-2 instances will cost: $0.096 x 3 instances = $0.285 per hour.

GKE is cheaper in most scenarios.

You can host these instances using committed use discounts reducing control-plane cost, but you have to upfront pay for it for months.

All the cost analysis was done in May 2020, all prices have been taken from the official Google Platform pricing lists:

Management

GKE is a fully managed service provided by GCP meaning that you don't need to worry about, backups, recoveries, availability, scalability, certificates… even authentication to the cluster is managed by Google.

You'll have to set up these features if you choose to host your control-plane. Also, other features can be customized in a self-managed setup: audit-logs, enable Kubernetes API server feature flags, set up your own authentication provider and other platform services.

So, if you need to set up a non default cluster, you should consider going with the self-managed cluster. Otherwise, GKE is a good option.

Day two operations

As mentioned before, GKE is responsible for making Kubernetes control plane fully operational with a monthly uptime percentage of at least 99.95%.

source: https://cloud.google.com/kubernetes-engine/sla

On the other side, in a self-managed set up you have to worry about backups, disaster recovery strategies, HA setup, certificate rotations, control-plane and worker updates.

Requirements

As mentioned in the common requirements the operator who is responsible for creating an GKE cluster has to have connectivity from the operator's machine (bastion host, laptop with configured VPN…) to the network where the cluster will be placed.

The machine used to create the cluster should have installed:

  • OS tooling like: git, ssh, curl and unzip.
  • terraform version > 0.12.
  • latest gcloud CLI version.

Cloud requirements

This installer requires to have mainly three requirements:

  • Dedicated VPC.
  • Enough permissions to create all resources surrounding the GKE cluster.
  • If your workloads need to have internet connectivity, you should understand how connectivity works in a GKE private cluster: Using Cloud NAT with GKE Cluster

Gather all input values

Before starting to use this installer, you should know the value of the input variables:

  • cluster_name: Unique cluster name.
  • cluster_version: GKE version to use. Example: 1.15.12-gke.6 or 1.16.10-gke.8. Take a look to discover available GKE Kubernetes versions.
  • network: Network name where the cluster will be created.
  • subnetworks: List of three subnetworks names:
    • index 0: The subnetwork to host the cluster.
    • index 1: The name of the secondary subnet ip range to use for pods.
    • index 2: The name of the secondary subnet range to use for services. All subnetworks should belong to network.
  • ssh_public_key: Cluster administrator public ssh key. Used to access cluster nodes with the operator_ssh_user
  • dmz_cidr_range: Network CIDR range from where the cluster's control plane will be accessible from.

Getting started

Make sure to set up all the pre-requirements before continuing including cloud credentials, VPN/Bastion/Network configuration and gathering all required input values.

Create a new directory to save all terraform files:

$ mkdir /home/operator/sighup/my-cluster-at-gke
$ cd /home/operator/sighup/my-cluster-at-gke

Create the following files:

main.tf

variable "cluster_name" {}
variable "cluster_version" {}
variable "network" {}
variable "subnetworks" { type = list }
variable "dmz_cidr_range" {}
variable "ssh_public_key" {}
variable "node_pools" { type = list }

module "my-cluster" {
  source = "github.com/sighupio/fury-gke-installer//modules/gke?ref=v1.0.0"

  cluster_version = var.cluster_version
  cluster_name    = var.cluster_name
  network         = var.network
  subnetworks     = var.subnetworks
  ssh_public_key  = var.ssh_public_key
  dmz_cidr_range  = var.dmz_cidr_range
  node_pools      = var.node_pools
}

data "google_client_config" "current" {}

output "kube_config" {
  sensitive = true
  value     = <<EOT
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ${module.my-cluster.cluster_certificate_authority}
    server: ${module.my-cluster.cluster_endpoint}
  name: gke
contexts:
- context:
    cluster: gke
    user: gke
  name: gke
current-context: gke
kind: Config
preferences: {}
users:
- name: gke
  user:
    token: ${data.google_client_config.current.access_token}
EOT
}

Create my-cluster.tfvars including your environment values:

cluster_name    = "my-cluster"
cluster_version = "1.14.10-gke.34"
network         = "gke-vpc"
subnetworks     = ["gke-subnet", "gke-subnet-pod", "gke-subnet-svc"]
ssh_public_key = "ssh-rsa example"
dmz_cidr_range = "10.10.0.0/16"
node_pools = [
  {
    name : "node-pool-1"
    version : null # To use the cluster_version
    min_size : 1
    max_size : 1
    instance_type : "n1-standard-1"
    volume_size : 100
    labels : {
      "sighup.io/role" : "app"
      "sighup.io/fury-release" : "v1.3.0"
    }
    taints : []
  },
  {
    name : "node-pool-2"
    version : "1.14.10-gke.34"
    min_size : 1
    max_size : 1
    instance_type : "n1-standard-2"
    volume_size : 50
    labels : {}
    taints : [
      "sighup.io/role=app:NoSchedule"
    ]
  }
]

With these two files, the installer is ready to create everything needed to set up a GKE Cluster with two different node pools (if you don't modify the node_pools variable example value) using Kubernetes 1.14.

$ ls -lrt
total 16
-rw-r--r--  1 sighup  staff  1171 27 abr 16:35 my-cluster.tfvars
-rw-r--r--  1 sighup  staff  1128 27 abr 16:36 main.tf
$ terraform init
Initializing modules...
- my-cluster in ../../modules/google/gke-sighup/modules/gke
Downloading terraform-google-modules/kubernetes-engine/google 8.1.0 for my-cluster.gke...
- my-cluster.gke in .terraform/modules/my-cluster.gke/terraform-google-kubernetes-engine-8.1.0/modules/beta-private-cluster

Initializing the backend...

Initializing provider plugins...
- Checking for available provider plugins...
- Downloading plugin for provider "null" (hashicorp/null) 2.1.2...
- Downloading plugin for provider "google" (hashicorp/google) 3.19.0...
- Downloading plugin for provider "kubernetes" (hashicorp/kubernetes) 1.11.1...
- Downloading plugin for provider "google-beta" (terraform-providers/google-beta) 3.19.0...
- Downloading plugin for provider "random" (hashicorp/random) 2.2.1...

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.null: version = "~> 2.1"
* provider.random: version = "~> 2.2"

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
$ terraform plan --var-file my-cluster.tfvars --out my-cluster.plan
<TRUNCATED OUTPUT>
Plan: 11 to add, 0 to change, 0 to destroy.

------------------------------------------------------------------------

This plan was saved to: my-cluster.plan

To perform exactly these actions, run the following command to apply:
    terraform apply "my-cluster.plan"

Review carefully the plan before applying anything. It should create 11 resources.

$ terraform apply my-cluster.plan
<TRUNCATED OUTPUT>
Apply complete! Resources: 11 added, 0 changed, 0 destroyed.

Outputs:

kubeconfig = <sensitive>

To get your kubeconfig file follow these simple commands:

kubectl will have a limited access in time token.

$ terraform output kubeconfig > kube.config
$ kubectl cluster-info --kubeconfig kube.config
Kubernetes master is running at https://10.0.0.2
calico-typha is running at https://10.0.0.2/api/v1/namespaces/kube-system/services/calico-typha:calico-typha/proxy
KubeDNS is running at https://10.0.0.2/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://10.0.0.2/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ kubectl get nodes --kubeconfig kube.config
NAME                                       STATUS   ROLES    AGE     VERSION
gke-my-cluster-node-pool-1-996e76a1-2bb0   Ready    <none>   3m58s   v1.15.12-gke.6
gke-my-cluster-node-pool-1-c474d9ba-34k7   Ready    <none>   3m55s   v1.15.12-gke.6
gke-my-cluster-node-pool-1-d066baff-rsrf   Ready    <none>   3m54s   v1.15.12-gke.6
gke-my-cluster-node-pool-2-8509e4af-prv9   Ready    <none>   5m19s   v1.15.12-gke.6
gke-my-cluster-node-pool-2-bc27c727-hw3l   Ready    <none>   5m20s   v1.15.12-gke.6
gke-my-cluster-node-pool-2-e20d430c-6s0n   Ready    <none>   5m24s   v1.15.12-gke.6

GKE number of nodes

GKE deploys the same number of nodes across zones to provide HA by default, meaning that if you specify just 1 node (min and max) in a node_pool (same as the example), you will end up with 3 nodes in the node_pool (if the region has 3 different availability zones).

Update control plane

To update the control plane, just modify the cluster_version with the next version available

$ diff my-cluster.tfvars my-cluster-updated.tfvars
2c2
< cluster_version = "1.15.12-gke.6"
---
> cluster_version = "1.16.10-gke.8"

after that modifiying the cluster_version execute:

$ terraform plan --var-file my-cluster-updated.tfvars --out my-cluster.plan
<TRUNCATED OUTPUT>
Plan: 0 to add, 2 to change, 0 to destroy.

Please, read carefully the output plan. Once you understand the changes, apply it:

$ terraform apply my-cluster.plan
<TRUNCATED OUTPUT>
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

It can take up to 25-30 minutes.

After updating the control-plane you end up with:

  • GKE control plane updated from Kubernetes version 1.15 to 1.16
  • The node-pool-1 Updated to 1.16 version. (Updated as it uses cluster_version)
  • The node-pool-2 remains in 1.15 Kubernetes version.

Update node pools

To update a node pool, just modify the node_pool's version attribute with the same version as the control-plane:

If you have set null, you don't need to do anything else, node_pools with null version are updated alongside the control-plane update procedure.

$ diff my-cluster.tfvars my-cluster-updated.tfvars
26c26
<     version : "1.15.12-gke.6"
---
>     version : "1.16.10-gke.8"

after that run:

$ terraform plan --var-file my-cluster-updated.tfvars --out my-cluster.plan
<TRUNCATED OUTPUT>
Plan: 0 to add, 1 to change, 0 to destroy.

Review the plan before applying anything:

$ terraform apply my-cluster.plan
<TRUNCATED OUTPUT>
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

It takes less than 10 minutes

Lift and Shift node pool update

You can apply another node pool update strategy named lift and shift. Create a new node pool with the new updated version then move all workloads to the new nodes and remove/set to 0 the number of instances in the old node pool.

Tear down the environment

If you don't need anymore the cluster, go to the terraform directory where create the cluster (cd /home/operator/sighup/my-cluster-at-gke) and type:

$ terraform destroy --var-file my-cluster.tfvars
<TRUNCATED OUTPUT>
Plan: 0 to add, 0 to change, 11 to destroy.

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

Type yes and press intro to continue the destruction. It will take around 15 minutes.