(Update: This post has been updated to reflect changing backup tool from WAL-E to WAL-G. WAL-G is a more modern and faster implementation of cloud backups for postgreSQL)
Several Redpill Linpro customers are now in the kubernetes way of delivery. Kubernetes has changed the way they work, and is acting as an effective catalyst empowering their developers. For these customers, the old-school way of running PostgreSQL is becoming a bit cumbersome:
The typical PostgreSQL installation has been based on bare metal, or, the past few years, virtual machines. They are often set up as streaming replication clusters, with a primary r/w instance, and one or more replicating r/o secondaries. Additionally, a lot of effort is placed on making sure the installation is highly available, and that backups are taken to separate hardware.
Speaking of backups - you should test recovering them regularly. Imagine doing that over many clusters that run even on virtual machines.
If you need to manage many clusters, this has a very noticable effect on the time you spend keeping them in the desired state - up, available, healthy and backed up. And then you run out of disk. It’s easy to place a whole lot of eggs in one basket in this case. Meaning downtime becomes global.
With kubernetes in the cloud, a lot of this changes. We no longer worry about hardware. In fact, we don’t have to worry too much about availability - kubernetes is supposed to solve that for us (even if challenges remain). That means we can focus more or less solely on the software side of the stack. For the developers, they gain control of their database config that they can deploy and control in the same manner as the other components.
Let’s take a look at the software that exists, and then walk throgh the installation and usage for running PostgreSQL in Google Kubernetes Engine.
Patroni is a template for high availability PostgreSQL. What Patroni does is set up a small process that manages a PostgreSQL instance - using a distributed config store to keep track of which instance is the primary instance in a cluster of multiple nodes. It does not manage that DCS by itself. DCS (etcd) is also precisely what kubernetes uses in the control plane to keep track of it’s own resources. This is perfect for a tool like patroni.
Patroni provides automated failover in the event the master pod should fail. Coupled with kubernetes - it also makes sure there are always (a configured) amount of standby nodes. Using kubernetes features, you can achieve very good HA provided your nodes are well distributed
It is written by an online fashion company called Zalando (that many of us will know from the user side) - highlighting the fact that almost anyone is an IT company these days. Zalando has put a lot of great effort into many of the tools we will use in this article.
But, returning to topic: To run PostgreSQL in kubernetes, we also need container images.
Spilo is a docker image also made by zalando, that packages major (9.x, 10 and 11) versions of PostgreSQL, Patroni, and WAL-G to use not only in kubernetes, buth also in other products like openshift or good old docker compose. A great deal of logic is placed in this image - controlled by environment variables.
We usually like to run PostgreSQL backups using barman or pgbackrest, but in the spilo image WAL-G is included. The reason is that it integrates really nicely with S3 or GCS, allowing backup functionality directly towards a bucket in the cloud. We will therefore use it to store backups in GCS.
prometheus and grafana
We’ve become fans of prometheus and grafana for metrics. While outside of the scope of this article, it is very possible to use for kubernetes monitoring generally - and we’d like to integrate the setup in this article. This article gives a nice overview to get started.
OK, almost there. There is one last component we need to gather all the tools above and make stuff really useful. Kubernetes has a nice way to abstract away tedious details from an implentation: Custom Resource Definitions, or CRDs.
The postgres operator is described in a short fashion at this link.
The basic idea is to deploy a pod that watches one or more namespaces, and allow a CRD to be deployed to form a PostgreSQL cluster statefulset with a useful number of pods to achieve HA. These pods will be running patroni to control a PostgreSQL instance, and offer WAL-G as a backup and recovery tool using GCS (or S3).
OK, enough talk - let’s do some work. We’ll keep things simple in this walkthrough - we’ll only work in the default namespace.
Install google cloud utils
We need the Google cloud SDK installed. It is outside our scope, but is well documented at this link. We recommend using native packaging on linux.
Make sure you have the following commands available before continuing:
$ gcloud $ gsutil $ kubectl
First of all, login to google cloud:
$ gcloud auth login
This will pull up a browser window and ask for permissions. Don’t worry, we’ll be back in the console soon.
Next, we can create a project to work on (If you want, you can of course work on an existing project):
$ gcloud projects create pgdemo-rl Create in progress for [https://cloudresourcemanager.googleapis.com/v1/projects/pgdemo-rl]. Waiting for [operations/cp.5807530738051270639] to finish...done. Enabling service [cloudapis.googleapis.com] on project [pgdemo-rl]... Operation "operations/acf.102a7260-30eb-48c9-99f7-7fea8fb68940" finished successfully.
Then set this to our current project:
$ gcloud config set project pgdemo-rl Updated property [core/project]
You will likely also have to enable billing:
$ gcloud alpha billing projects link pgdemo-rl --billing-account <ID> billingAccountName: billingAccounts/<ID> billingEnabled: true name: projects/pgdemo-rl/billingInfo projectId: pgdemo-rl
You can list billing accounts using:
$ gcloud alpha billing accounts list
Also, we must enable the kubernetes API for the project:
$ gcloud services enable container.googleapis.com --project pgdemo-rl
This will take a minute or two. Grab a coffee, peek out the window and rest your eyes for a little while.
Create the kubernetes cluster
A single command will fire up a GKE cluster for us. We stick to defaults, which will create a three node cluster with standard nodes. The control plane is hidden for us, and managed by google - but we will have access to it from the three nodes. Note that we also need to explicitly state that we want read/write access to storage - this is to enable writing to GCS buckets.
We live in Scandinavia, so europe-north1-a is a fine place for the nodes to live.
$ gcloud container clusters create pgcluster --zone=europe-north1-a --scopes=storage-rw --machine-type n1-standard-2 --billing-project pgdemo-rl <WARNINGs suppressed> Creating cluster pgcluster in europe-north1-a... Cluster is being configured.. .⠧ Creating cluster pgcluster in europe-north1-a... Cluster is being health-check ed (master is healthy)...done. Created [https://container.googleapis.com/v1/projects/pgdemo-rl/zones/europe-north1-a/clusters/pgcluster]. To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/europe-north1-a/pgcluster?project=pgdemo-rl kubeconfig entry generated for pgcluster. NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS pgcluster europe-north1-a 1.13.7-gke.8 126.96.36.199 n1-standard-2 1.13.7-gke.8 3 RUNNING
Now we need to generate a kubeconfig:
$ gcloud container clusters get-credentials pgcluster --zone europe-north1-a --project pgdemo-rl Fetching cluster endpoint and auth data. kubeconfig entry generated for pgcluster. $ kubectl get nodes NAME STATUS ROLES AGE VERSION gke-pgcluster-default-pool-6d807f36-4j7s Ready <none> 43s v1.13.7-gke.8 gke-pgcluster-default-pool-6d807f36-g2g4 Ready <none> 44s v1.13.7-gke.8 gke-pgcluster-default-pool-6d807f36-xptn Ready <none> 44s v1.13.7-gke.8
Allright, we have contact. Next up is the GCS bucket for backups.
Create a GCS bucket
GCS buckets are global, so we create like this:
$ gsutil mb -b on -l europe-north1 gs://pgbucket-rl Creating gs://pgbucket-rl/...
You may want to look at storage options that lower the cost for backups, given they will primarily see writes, and seldom reads.
When you have a named bucket, you are ready to move on.
Setting up the postgres-operator
First of all, clone the postgres operator repo. We’ll work from the master branch.
$ git clone https://github.com/zalando/postgres-operator Cloning into 'postgres-operator'... remote: Enumerating objects: 133, done. remote: Counting objects: 100% (133/133), done. remote: Compressing objects: 100% (90/90), done. remote: Total 9943 (delta 35), reused 86 (delta 30), pack-reused 9810 Receiving objects: 100% (9943/9943), 3.60 MiB | 5.80 MiB/s, done. Resolving deltas: 100% (6634/6634), done. # Descend into code $ cd postgres-operator
The postgres operator from Zalando can be configured in two ways - using a classical configmap, or using a CRD configuration object. The former is deprecated, and the latter is more powerful. We opt for the CRD configuration.
To install the operator using the CRD config, we first need to apply a service account definition:
$ kubectl create -f manifests/operator-service-account-rbac.yaml serviceaccount/zalando-postgres-operator created clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator created clusterrolebinding.rbac.authorization.k8s.io/zalando-postgres-operator created
This creates a serviceaccount, and gives it a range of permissions to operate on the coming definitions.
Next, we face one of the missing features in postgres-operator - it cannot yet configure backups using it’s own configuration. It is, however, possible to configure WAL-G in the spilo image directly. To do so, we create a separate configmap that we will refer to later. Let’s create a file manifests/postgres-pod-config.yaml, with the following content:
apiVersion: v1 kind: ConfigMap metadata: name: postgres-pod-config namespace: default data: # Enable to turn backups on. USE_WALG_BACKUP: "true" USE_WALG_RESTORE: "true" WALG_GS_PREFIX: gs://pgbucket-rl/spilo/$(SCOPE) CLONE_WALG_GS_PREFIX: gs://pgbucket-rl/spilo/$(CLONE_SCOPE) WALE_BACKUP_THRESHOLD_PERCENTAGE: "100"
This defines the variables WAL-G needs to start backing up database and transaction logs to GCS. The variables SCOPE and CLONE_SCOPE will be defined in the containers where appropriate below (by spilo).
In addition, the last variable in effect tells spilo to always use pg_basebackup from the master node to create new replica pods.
We must apply this manifest to make it available:
$ kubectl create -f manifests/postgres-pod-config.yaml configmap/postgres-pod-config created
Completing the CRD config
In order to use the CRD config, we must change a value in the postgres-operator itself. Change the last few lines in manifests/postgres-operator.yaml so they read:
env: # provided additional ENV vars can overwrite individual config map entries #- name: CONFIG_MAP_NAME # value: "postgres-operator" # In order to use the CRD OperatorConfiguration instead, uncomment these lines and comment out the two lines above - name: POSTGRES_OPERATOR_CONFIGURATION_OBJECT value: postgresql-operator-default-configuration
To complete the configuration, we must also adjust and create the actual config object referenced. This is placed in manifests/postgresql-operator-default-configuration.yaml. These are the values we must set (rest of file excluded, please find relevant lines):
configuration: docker_image: krorvik/spilo:walgfix kubernetes: enable_pod_antiaffinity: true pod_environment_configmap: postgres-pod-config pod_service_account_name: zalando-postgres-operator
First, we tell postgres-operator to use antiaffinity for pods - in practice this means that pods in the same cluster will refuse to live on the same node. It is possible to use other mechanisms, zone for instance.
Next, we change the docker image to a version with the newest version of wal-g installed. Otherwise we will run into a bug later, related to restores. We expect this to be temporary. Once spilo uses version v0.2.13 or newer of wal-g, this should not be needed.
Third, we give a pointer to the extra configmap we created above to enable backups. Otherwise, nothing will happen backupwise.
Finally, the service account name given in that file does not match that given by the operator service account definition we gave first, so we adjust that.
Let’s create the operator and it’s configuration. We take a short break between, else the second may fail. If it does, grab a second (?) coffee before retrying:
$ kubectl create -f manifests/postgres-operator.yaml deployment.apps/postgres-operator created $ sleep 20s $ kubectl create -f postgresql-operator-default-configuration.yaml operatorconfiguration.acid.zalan.do/postgresql-operator-default-configuration created
Wait a few seconds, and check the operator is up and running:
$ kubectl get pods NAME READY STATUS RESTARTS AGE postgres-operator-599fd68d95-6k2rp 1/1 Running 0 22s
In addition - if you want other users of your cluster to be able to list and create clusters, you need to create a separate set of authorizations:
$ kubectl create -f manifests/user-facing-clusterroles.yaml clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator:users:admin created clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator:users:edit created clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator:users:view created
This last one is optional, and not needed for this article. We choose to show it since it is needed in most other cases.
All good. We’ve done a little bit of work so far, but at this point, things become interesting!
Launch a cluster
Clusters are defined by simple yaml manifests. Let’s start with the minimum we need for real world usage:
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: acid-minimal-cluster namespace: default spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 2 users: zalando: # database owner - superuser - createdb foo_user:  # role for application foo databases: foo: foo_user # dbname: owner postgresql: version: "11"
This manifest defines a cluster named “acid-minimal-cluster”, owned by a team “ACID” (this team and the first part of the name must match). It will claim a persistent volume of 1GiB, with two instances (translating to a primary and a replica). Further, some users are defined, and a database. Finally, we choose to run with major version 11. We don’t want to use old stuff ;)
Note that another user will be created in addition, called “standby”. This is the user with replication privileges that will perform streaming replication to the replica server.
All users will receive generated passwords that will be placed in socalled kubernetes secrets. That means no passwords exposed in code. Nice!
Let’s create this cluster:
$ kubectl create -f manifests/minimal-postgres-manifest.yaml postgresql.acid.zalan.do/acid-minimal-cluster created
After a little while, things should look like this:
$ kubectl get pods NAME READY STATUS RESTARTS AGE acid-minimal-cluster-0 1/1 Running 0 37s acid-minimal-cluster-1 0/1 ContainerCreating 0 10s postgres-operator-599fd68d95-6k2rp 1/1 Running 0 31m
As we can see, the primary pod is already up, and the second is creating. It should soon be running. Let’s take a look at what is happening in the first one:
$ kubectl exec -it acid-minimal-cluster-0 su postgres postgres@acid-minimal-cluster-0:~$ psql -l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+-------------+-------------+----------------------- foo | zalando | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres (4 rows) postgres@acid-minimal-cluster-0:~$ patronictl list +----------------------+------------------------+-----------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +----------------------+------------------------+-----------+--------+---------+----+-----------+ | acid-minimal-cluster | acid-minimal-cluster-0 | 10.12.2.5 | Leader | running | 1 | 0 | | acid-minimal-cluster | acid-minimal-cluster-1 | 10.12.2.6 | | running | 1 | 0 | +----------------------+------------------------+-----------+--------+---------+----+-----------+
Cool! One command to create a replicating PostgreSQL 11 cluster. And, in your GCS browser, you should now see a folder /spilo/acid-minimal-cluster, with subfolder for wal and basebackup.
Resize the cluster
Resizing the cluster now is easy - we change the numberOfInstances key in the cluster manifest:
$ git diff -U0 minimal-postgres-manifest.yaml <SNIP> @@ -10 +10 @@ spec: - numberOfInstances: 2 + numberOfInstances: 3
Then apply the manifest:
$ kubectl apply -f minimal-postgres-manifest.yaml postgresql.acid.zalan.do/acid-minimal-cluster configured $ kubectl get pods NAME READY STATUS RESTARTS AGE acid-minimal-cluster-0 1/1 Running 0 11m acid-minimal-cluster-1 1/1 Running 0 10m acid-minimal-cluster-2 1/1 Running 0 30s postgres-operator-599fd68d95-2q8t6 1/1 Running 0 11m $ kubectl exec -it acid-minimal-cluster-0 su postgres postgres@acid-minimal-cluster-0:~$ patronictl list +----------------------+------------------------+-----------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +----------------------+------------------------+-----------+--------+---------+----+-----------+ | acid-minimal-cluster | acid-minimal-cluster-0 | 10.12.2.7 | Leader | running | 1 | 0 | | acid-minimal-cluster | acid-minimal-cluster-1 | 10.12.2.8 | | running | 1 | 0 | | acid-minimal-cluster | acid-minimal-cluster-2 | 10.12.1.9 | | running | 1 | 0 | +----------------------+------------------------+-----------+--------+---------+----+-----------+
All good! Patroni will fire up another pod, and run pg_basebackup with the right options to create a new replica.
In GCS, we cannot resize storage without pod downtime. Still, the process is not too complex. First, edit the PVC for the pod - for instance:
$ kubectl edit pvc pgdata-acid-minimal-cluster-2
Find the spec.resources.requests.storage key, and set it to the desired value. Save and exit. Then delete the corresponding pod:
$ kubectl delete pod acid-minimal-cluster-2 pod "acid-minimal-cluster-2" deleted
To minimize downtime for your application, we suggest this is run for the replicas first (shown by patronictl list on any pod), and the primary last. This will give the application only one disconnect. At this time, you may argue that it would in fact be possible to do this with zero downtime on a physical cluster with LVM. Well, yes. But then again, your developers likely would not have the opportunity to do that.
Deleting the cluster
We can tear down the cluster by deleting the postgresql-type:
$ kubectl delete postgresql acid-minimal-cluster
All associated resources should be queued for deletion. Keep in mind we will still have the backup in GCS.
Monitoring using a sidecar
postgres-operator allows us to attach an arbitrary number of sidecars. One of the main reasons is monitoring. A monitoring sidecar will have access to the data directory, and may of course also talk to main container using TCP. It does not give access to the Unix socket postgres uses. This is not a problem, TCP communications is already simple. We will show how to set up a prometheus postgres exporter sidecar for monitoring.
The sidecar is set up using the cluster manifest, and we will be using a public postgres_exporter image. Lets see:
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: acid-monitoring-cluster namespace: default spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 1 users: zalando: # database owner - superuser - createdb prometheus:  databases: foo: zalando # dbname: owner postgresql: version: "11" sidecars: - name: "postgres-exporter" image: "wrouesnel/postgres_exporter:latest" env: - name: "MONPASS" valueFrom: secretKeyRef: name: prometheus.acid-monitoring-cluster.credentials.postgresql.acid.zalan.do key: password - name: "DATA_SOURCE_NAME" value: "postgresql://prometheus:$(MONPASS)@$(POD_NAME):5432/postgres?sslmode=require"
We will be creating a new cluster, with one node only, and monitoring set up.
The image part is self explanatory. We only need to pass in a single environment variable to make it work, and that is DATA_SOURCE_NAME - which defines with what credentials and to what resource we want to connect. We are getting the password for user prometheus into the env var MONPASS via a secretKeyRef (meaning; get it from the kubernets secret), and make sure we are refering to the correct secret. The POD_NAME is the internal hostname for the pod, and we are using ssl since pg_hba for spilo sets up an hostssl line in pg_hba.conf.
We can now apply this manifest and see what happens:
$ kubectl apply -f monitoring-postgres-manifest.yaml postgresql.acid.zalan.do/acid-monitoring-cluster created $ sleep 10s $ kubectl get pods NAME READY STATUS RESTARTS AGE acid-monitoring-cluster-0 2/2 Running 0 4m28s postgres-operator-599fd68d95-2q8t6 1/1 Running 0 121m
Note that each cluster pod now has 2 containers. The other one is the prometheus container. Let’s connect to the cluster container, and see if we can get metrics:
postgres@acid-monitoring-cluster-0:~$ curl -s http://localhost:9187/metrics | grep pg_up # HELP pg_up Whether the last scrape of metrics from PostgreSQL was able to connect to the server (1 for yes, 0 for no). # TYPE pg_up gauge pg_up 1
Very well, we now have a cluster we can scrape metrics from. Setting up a telegraf sidecar for reporting to influxDB would follow the same pattern, using environment variables to tell telegraf how to connect - and which influxDB to report to. Let’s tear the cluster down before continuing:
$ kubectl delete postgresql acid-monitoring-cluster postgresql.acid.zalan.do "acid-monitoring-cluster" deleted
At this point, there should be no clusters left - just the backups in GCS.
Now, imagine we lost our entire kubernetes cluster. How do we get our data back? First of all, we need to set up the operator again:
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: acid-minimal-cluster namespace: default spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 2 postgresql: version: "11" clone: name: acid-minimal-cluster timestamp: "2019-09-23T17:10:00+02:00"
You can put the timestamp in the future to recover all the way to the last entry in the transaction log.
Then we can create it using:
$ kubectl create -f minimal-restore-manifest.yaml postgresql.acid.zalan.do/acid-minimal-cluster created $ kubectl get pods NAME READY STATUS RESTARTS AGE acid-minimal-cluster-0 1/1 Running 0 37s acid-minimal-cluster-1 1/1 Running 0 23s postgres-operator-599fd68d95-2q8t6 1/1 Running 0 3h51m
We should then be able to connect and find our data restored.
Also note that this method can be used to duplicate running clusters.
This is as simple as changing the image definition in the CRD configuration, and applying the configuration. Follow up by deleting the postgres-operator-
This will trigger a rolling update of the pods - standbys first, masters last.
$ kubectl apply -f manifests/postgresql-operator-default-configuration.yaml #Changed $ kubectl delete pod postgres-operator-599fd68d95-2q8t6
Be careful though, this will update all clusters the operator is watching at the same time. One way to minimize issues around this is to have separate operators per namespace.
This one is a bit more complex - the operator does not support doing it directly. However, the spilo image can detect that a clone is performed from an older major release, and update the cluster files before starting. Hence, if you perform a clone to a newer major version, you should be good to go. This is the exact same procedure as the disaster recovery scenario, but with the major version changed.
Some notes on kubernetes and memory limits
One problem with PostgreSQL in kubernetes is sizing pods for your workload. PostgreSQL uses memory quite actively, and as such, there are several things to be aware of:
- shared_buffers takes out a fixed amount of memory
- Each connection takes out a small slice of memory
- Every query requires memory to create and deliver result sets
- Builtin utility and background processes need memory
- There’s no swap in kubernetes…
It’s almost never a good idea to overdo the static parts (shared_buffers), and forget the memory other processes need. First of all, PostgreSQL relies heavily on memory being available. Second, if you hit the hard limit for a pod, kubernetes will kill it. No questions asked.
So how do we avoid that?
Spilo will default shared_buffers to 20% of the memory limit unless we tell it otherwise. If no limit is explicitly given, the default config in postgres-operator says to set 1G as memory limit. This is done so that there is memory available for the bullet points above. We recommend you do not override the shared_buffers setting.
work_mem is set at 4MB by default, and we also recommend you leave it - unless you get heavy usage of temp files (which, of course, you are monitoring). But beware that this can happen with single queries running often - can you improve the query? In many cases, you can.
TL;DR: For development or smaller clusters, the defaults should get you a long way. For larger clusters, we need more experience, but keeping shared_buffers moderate is likely a good candidate. The effects of downsizing shared_buffers in kubernetes are likely to increase the io load, but will increase the memory available for queries and background jobs.
There is more than enough quality in the solutions mapped here to use this setup in production. We have customers in early production, and others now moving to the testing phase. We’ve seen the dreaded OOM killer a few times, but have been able to give the pods enough resources without going too far. Even so - in kubernetes your application should be able to survive minor downtimes if pods die. That includes the database connection. Designing for it is a good idea.
The postgres-operator provides a very nice way to allow development teams to take control of the databases, and to manage them in the same way as they do for the rest of the application or service - including disk and memory allocations. Kubernetes secrets keep secrets from spilling out into version control.
For larger monolithic databases and applications that are not on kubernetes, you will probably want to stay on servers a bit longer.
If you are already on Kubernetes, and have smaller databases (not unlikely for microservices for instance), the postgres-operator should provide operations with less headaches, and the developers with more control (and of course, more responsibility).