Several Redpill Linpro customers are now in the kubernetes way of delivery. Kubernetes has changed the way they work, and is acting as an effective catalyst empowering their developers. For these customers, the old-school way of running PostgreSQL is becoming a bit cumbersome:
The typical PostgreSQL installation has been based on bare metal, or, the past few years, virtual machines. They are often set up as streaming replication clusters, with a primary r/w instance, and one or more replicating r/o secondaries. Additionally, a lot of effort is placed on making sure the installation is highly available, and that backups are taken to separate hardware.
Speaking of backups - you should test recovering them regularly. Imagine doing that over many clusters that run even on virtual machines.
If you need to manage many clusters, this has a very noticable effect on the time you spend keeping them in the desired state - up, available, healthy and backed up. And then you run out of disk. It’s easy to place a whole lot of eggs in one basket in this case. Meaning downtime becomes global.
With kubernetes in the cloud, a lot of this changes. We no longer worry about hardware. In fact, we don’t have to worry too much about availability - kubernetes is supposed to solve that for us (even if challenges remain). That means we can focus more or less solely on the software side of the stack. For the developers, they gain control of their database config that they can deploy and control in the same manner as the other components.
Let’s take a look at the software that exists, and then walk throgh the installation and usage for running PostgreSQL in Google Kubernetes Engine.
Patroni is a template for high availability PostgreSQL. What Patroni does is set up a small process that manages a PostgreSQL instance - using a distributed config store to keep track of which instance is the primary instance in a cluster of multiple nodes. It does not manage that DCS by itself. DCS (etcd) is also precisely what kubernetes uses in the control plane to keep track of it’s own resources. This is perfect for a tool like patroni.
Patroni provides automated failover in the event the master pod should fail. Coupled with kubernetes - it also makes sure there are always (a configured) amount of standby nodes. Using kubernetes features, you can achieve very good HA provided your nodes are well distributed
It is written by an online fashion company called Zalando (that many of us will know from the user side) - highlighting the fact that almost anyone is an IT company these days. Zalando has put a lot of great effort into many of the tools we will use in this article.
But, returning to topic: To run PostgreSQL in kubernetes, we also need container images.
Spilo is a docker image also made by zalando, that packages major (9.x, 10 and 11) versions of PostgreSQL, Patroni, and WAL-E to use not only in kubernetes, buth also in other products like openshift or good old docker compose. A great deal of logic is placed in this image - controlled by environment variables.
We usually like to run PostgreSQL backups using barman or pgbackrest, but in the spilo image WAL-E is included. The reason is that it integrates really nicely with S3 or GCS, allowing backup functionality directly towards a bucket in the cloud. We will therefore use it to store backups in GCS.
prometheus and grafana
We’ve become fans of prometheus and grafana for metrics. While outside of the scope of this article, it is very possible to use for kubernetes monitoring generally - and we’d like to integrate the setup in this article. This article gives a nice overview to get started.
OK, almost there. There is one last component we need to gather all the tools above and make stuff really useful. Kubernetes has a nice way to abstract away tedious details from an implentation: Custom Resource Definitions, or CRDs.
The postgres operator is described in a short fashion at this link.
The basic idea is to deploy a pod that watches one or more namespaces, and allow a CRD to be deployed to form a PostgreSQL cluster statefulset with a useful number of pods to achieve HA. These pods will be running patroni to control a PostgreSQL instance, and offer WAL-E as a backup and recovery tool using GCS (or S3).
OK, enough talk - let’s do some work. We’ll keep things simple in this walkthrough - we’ll only work in the default namespace.
Install google cloud utils
We need the Google cloud SDK installed. It is outside our scope, but is well documented at this link. We recommend using native packaging on linux.
Make sure you have the following commands available before continuing:
$ gcloud $ gsutil $ kubectl
First of all, login to google cloud:
$ gcloud auth login
This will pull up a browser window and ask for permissions. Don’t worry, we’ll be back in the console soon.
Next, we can create a project to work on (If you want, you can of course work on an existing project):
$ gcloud projects create pgdemo-rl Create in progress for [https://cloudresourcemanager.googleapis.com/v1/projects/pgdemo-rl]. Waiting for [operations/cp.5807530738051270639] to finish...done. Enabling service [cloudapis.googleapis.com] on project [pgdemo-rl]... Operation "operations/acf.102a7260-30eb-48c9-99f7-7fea8fb68940" finished successfully.
Then set this to our current project:
$ gcloud config set project pgdemo-rl Updated property [core/project]
You will likely also have to enable billing:
$ gcloud alpha billing projects link pgdemo-rl --billing-account <ID> billingAccountName: billingAccounts/<ID> billingEnabled: true name: projects/pgdemo-rl/billingInfo projectId: pgdemo-rl
You can list billing accounts using:
$ gcloud alpha billing accounts list
Also, we must enable the kubernetes API for the project:
$ gcloud services enable container.googleapis.com --project pgdemo-rl
This will take a minute or two. Grab a coffee, peek out the window and rest your eyes for a little while.
Create the kubernetes cluster
A single command will fire up a GKE cluster for us. We stick to defaults, which will create a three node cluster with standard nodes. The control plane is hidden for us, and managed by google - but we will have access to it from the three nodes. Note that we also need to explicitly state that we want read/write access to storage - this is to enable writing to GCS buckets.
We live in Scandinavia, so europe-north1-a is a fine place for the nodes to live too.
$ gcloud container clusters create pgcluster --zone=europe-north1-a --scopes=storage-rw --billing-project pgdemo-rl <WARNINGs suppressed> Creating cluster pgcluster in europe-north1-a... Cluster is being configured.. .⠧ Creating cluster pgcluster in europe-north1-a... Cluster is being health-check ed (master is healthy)...done. Created [https://container.googleapis.com/v1/projects/pgdemo-rl/zones/europe-north1-a/clusters/pgcluster]. To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/europe-north1-a/pgcluster?project=pgdemo-rl kubeconfig entry generated for pgcluster. NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS pgcluster europe-north1-a 1.13.7-gke.8 184.108.40.206 n1-standard-1 1.13.7-gke.8 3 RUNNING
Now we need to generate a kubeconfig:
$ gcloud container clusters get-credentials pgcluster --zone europe-north1-a --project pgdemo-rl Fetching cluster endpoint and auth data. kubeconfig entry generated for pgcluster. $ kubectl get nodes NAME STATUS ROLES AGE VERSION gke-pgcluster-default-pool-6d807f36-4j7s Ready <none> 43s v1.13.7-gke.8 gke-pgcluster-default-pool-6d807f36-g2g4 Ready <none> 44s v1.13.7-gke.8 gke-pgcluster-default-pool-6d807f36-xptn Ready <none> 44s v1.13.7-gke.8
Allright, we have contact. Next up is the GCS bucket for backups.
Create a GCS bucket
GCS buckets are global, so we create like this:
$ gsutil mb -b on -l europe-north1 gs://pgbucket-rl Creating gs://pgbucket-rl/...
You may want to look at storage options that lower the cost for backups, given they will primarily see writes, and seldom reads.
When you have a named bucket, you are ready to move on.
Setting up the postgres-operator
First of all, clone the postgres operator repo. We’ll work from the master branch.
$ git clone https://github.com/zalando/postgres-operator Cloning into 'postgres-operator'... remote: Enumerating objects: 133, done. remote: Counting objects: 100% (133/133), done. remote: Compressing objects: 100% (90/90), done. remote: Total 9943 (delta 35), reused 86 (delta 30), pack-reused 9810 Receiving objects: 100% (9943/9943), 3.60 MiB | 5.80 MiB/s, done. Resolving deltas: 100% (6634/6634), done. # Descend into code $ cd postgres-operator
The postgres operator from Zalando can be configured in two ways - using a classical configmap, or using a CRD configuration object. The former is deprecated, and the latter is more powerful. We opt for the CRD configuration.
To install the operator using the CRD config, we first need to apply a service account definition:
$ kubectl create -f manifests/operator-service-account-rbac.yaml serviceaccount/zalando-postgres-operator created clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator created clusterrolebinding.rbac.authorization.k8s.io/zalando-postgres-operator created
This creates a serviceaccount, and gives it a range of permissions to operate on the coming definitions.
Next, we face one of the missing features in postgres-operator - it cannot yet configure backups using it’s own configuration. It is, however, possible to configure WAL-E in the spilo image directly. To do so, we create a separate configmap that we will refer to later. Let’s create a file manifests/postgres-pod-config.yaml, with the following content:
apiVersion: v1 kind: ConfigMap metadata: name: postgres-pod-config namespace: default data: WALE_GS_PREFIX: gs://pgbucket-rl/spilo/$(SCOPE) WALE_CLONE_SCOPE: gs://pgbucket-rl/spilo/$(CLONE_SCOPE)
This defines the two variables WAL-E needs to start backing up database and transaction logs to GCS. Note that the paths are not random, they match defaults from spilo using our own naming. The variables SCOPE and CLONE_SCOPE will be defined in the containers where appropriate below.
We must apply this manifest to make it available:
$ kubectl create -f manifests/postgres-pod-config.yaml configmap/postgres-pod-config created
Completing the CRD config
In order to use the CRD config, we must change a value in the postgres-operator itself. Change the last few lines in manifests/postgres-operator.yaml so they read:
env: # provided additional ENV vars can overwrite individual config map entries #- name: CONFIG_MAP_NAME # value: "postgres-operator" # In order to use the CRD OperatorConfiguration instead, uncomment these lines and comment out the two lines above - name: POSTGRES_OPERATOR_CONFIGURATION_OBJECT value: postgresql-operator-default-configuration
In order to complete the configuration, we must also adjust and create the actual config object referenced. This is placed in manifests/postgresql-operator-default-configuration.yaml. Let’s take a look at the diff we must apply:
$ git diff -U0 manifests/postgresql-operator-default-configuration.yaml <SNIP> @@ -7 +7 @@ configuration: - docker_image: registry.opensource.zalan.do/acid/spilo-11:1.5-p9 + docker_image: krorvik/spilo:finn-5 @@ -9,2 +9,2 @@ configuration: - max_instances: -1 - min_instances: -1 + max_instances: 5 + min_instances: 1 @@ -35 +35 @@ configuration: - # pod_environment_configmap: "" + pod_environment_configmap: "postgres-pod-config" @@ -38 +38 @@ configuration: - pod_service_account_name: operator + pod_service_account_name: zalando-postgres-operator
First of all, there is a bug in WAL-E that makes it return naive datetime objects (python) when querying GCS for object information. It will effectively prevent restores. The bug is only present in GCS. The given image krorvik/spilo:finn-5 works around that bug. This is NOT a recommended approach for production, but will suffice for our needs.
Next, we adjust the min and max number of instances. Setting the minimum is particularly important - it prevents you from taking down all pods if, for instance, you do like us and indent the wrong lines in the cluster definition (which we’ll see soon). Zap - lesson learned.
Third, we give a pointer to the extra configmap we created above to enable backups. Otherwise, nothing will happen backupwise.
Finally, the service account name given in that file does not match that given by the operator service account definition we gave first, so we adjust that.
Let’s create the operator and it’s configuration. We take a short break between, else the second may fail. If it does, grab a second (?) coffee before retrying:
$ kubectl create -f manifests/postgres-operator.yaml deployment.apps/postgres-operator created $ sleep 30s $ kubectl create -f postgresql-operator-default-configuration.yaml operatorconfiguration.acid.zalan.do/postgresql-operator-default-configuration created
Wait a few seconds, and check the operator is up and running:
$ kubectl get pods NAME READY STATUS RESTARTS AGE postgres-operator-599fd68d95-6k2rp 1/1 Running 0 22s
In addition - if you want other users of your cluster to be able to list and create clusters, you need to create a separate set of authorizations:
$ kubectl create -f manifests/user-facing-clusterroles.yaml clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator:users:admin created clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator:users:edit created clusterrole.rbac.authorization.k8s.io/zalando-postgres-operator:users:view created
This last one is optional, and not needed for this article. We choose to show it since it is needed in most other cases.
All good. We’ve done a little bit of work so far, but at this point, things become interesting!
Launch a cluster
Clusters are defined by simple yaml manifests. Let’s start with the minimum we need for real world usage:
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: acid-minimal-cluster namespace: default spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 2 users: zalando: # database owner - superuser - createdb foo_user:  # role for application foo databases: foo: foo_user # dbname: owner postgresql: version: "11"
This manifest defines a cluster named “acid-minimal-cluster”, owned by a team “ACID” (this team and the first part of the name must match). It will claim a persistent volume of 1GiB, with two instances (translating to a primary and a replica). Further, some users are defined, and a database. Finally, we choose to run with major version 11. We don’t want to use old stuff ;)
Note that another user will be created in addition, called “standby”. This is the user with replication privileges that will perform streaming replication to the replica server.
All users will receive generated passwords that will be placed in socalled kubernetes secrets. That means no passwords exposed in code. Nice!
Let’s create this cluster:
$ kubectl create -f manifests/minimal-postgres-manifest.yaml postgresql.acid.zalan.do/acid-minimal-cluster created
After a little while, things should look like this:
$ kubectl get pods NAME READY STATUS RESTARTS AGE acid-minimal-cluster-0 1/1 Running 0 37s acid-minimal-cluster-1 0/1 ContainerCreating 0 10s postgres-operator-599fd68d95-6k2rp 1/1 Running 0 31m
As we can see, the primary pod is already up, and the second is creating. It should soon be running. Let’s take a look at what is happening in the first one:
$ kubectl exec -it acid-minimal-cluster-0 su postgres postgres@acid-minimal-cluster-0:~$ psql -l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+-------------+-------------+----------------------- foo | zalando | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres (4 rows) postgres@acid-minimal-cluster-0:~$ patronictl list +----------------------+------------------------+-----------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +----------------------+------------------------+-----------+--------+---------+----+-----------+ | acid-minimal-cluster | acid-minimal-cluster-0 | 10.12.2.5 | Leader | running | 1 | 0 | | acid-minimal-cluster | acid-minimal-cluster-1 | 10.12.2.6 | | running | 1 | 0 | +----------------------+------------------------+-----------+--------+---------+----+-----------+
Cool! One command to create a replicating PostgreSQL 11 cluster. And, in your GCS browser, you should now see a folder /spilo/acid-minimal-cluster, with subfolder for wal and basebackup.
Speaking of backups - there is one critical piece of information we need later, when restoring a cluster - it’s uid:
$ kubectl get postgresql -o yaml acid-minimal-cluster | grep uid uid: 23fbac89-de13-11e9-9fe2-42010aa600ea
Save this for later - and note that it is a good idea to store these uids for disaster recovery.
Resize the cluster
Resizing the cluster now is easy - we change the numberOfInstances key in the cluster manifest:
$ git diff -U0 minimal-postgres-manifest.yaml <SNIP> @@ -10 +10 @@ spec: - numberOfInstances: 2 + numberOfInstances: 3
Then apply the manifest:
$ kubectl apply -f minimal-postgres-manifest.yaml postgresql.acid.zalan.do/acid-minimal-cluster configured $ kubectl get pods NAME READY STATUS RESTARTS AGE acid-minimal-cluster-0 1/1 Running 0 11m acid-minimal-cluster-1 1/1 Running 0 10m acid-minimal-cluster-2 1/1 Running 0 30s postgres-operator-599fd68d95-2q8t6 1/1 Running 0 11m $ kubectl exec -it acid-minimal-cluster-0 su postgres postgres@acid-minimal-cluster-0:~$ patronictl list +----------------------+------------------------+-----------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +----------------------+------------------------+-----------+--------+---------+----+-----------+ | acid-minimal-cluster | acid-minimal-cluster-0 | 10.12.2.7 | Leader | running | 1 | 0 | | acid-minimal-cluster | acid-minimal-cluster-1 | 10.12.2.8 | | running | 1 | 0 | | acid-minimal-cluster | acid-minimal-cluster-2 | 10.12.1.9 | | running | 1 | 0 | +----------------------+------------------------+-----------+--------+---------+----+-----------+
All good! Patroni will fire up another pod, and run pg_basebackup with the right options to create a new replica.
In GCS, we cannot resize storage without pod downtime. Still, the process is not too complex. First, edit the PVC for the pod - for instance:
$ kubectl edit pvc pgdata-acid-minimal-cluster-2
Find the spec.resources.requests.storage key, and set it to the desired value. Save and exit. Then delete the corresponding pod:
$ kubectl delete pod acid-minimal-cluster-2 pod "acid-minimal-cluster-2" deleted
To minimize downtime for your application, we suggest this is run for the replicas first (shown by patronictl list on any pod), and the primary last. This will give the application only one disconnect. At this time, you may argue that it would in fact be possible to do this with zero downtime on a physical cluster with LVM. Well, yes. But then again, your developers likely would not have the opportunity to do that.
Deleting the cluster
We can tear down the cluster by deleting the postgresql-type:
$ kubectl delete postgresql acid-minimal-cluster
All associated resources should be queued for deletion. Keep in mind we will still have the backup in GCS.
Monitoring using a sidecar
postgres-operator allows us to attach an arbitrary number of sidecars. One of the main reasons is monitoring. A monitoring sidecar will have access to the data directory, and may of course also talk to main container using TCP. It does not give access to the Unix socket postgres uses. This is not a problem, TCP communications is already simple. We will show how to set up a prometheus postgres exporter sidecar for monitoring.
The sidecar is set up using the cluster manifest, and we will be using a public postgres_exporter image. Lets see:
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: acid-monitoring-cluster namespace: default spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 1 users: zalando: # database owner - superuser - createdb prometheus:  databases: foo: zalando # dbname: owner postgresql: version: "11" sidecars: - name: "postgres-exporter" image: "wrouesnel/postgres_exporter:latest" env: - name: "MONPASS" valueFrom: secretKeyRef: name: prometheus.acid-monitoring-cluster.credentials.postgresql.acid.zalan.do key: password - name: "DATA_SOURCE_NAME" value: "postgresql://prometheus:$(MONPASS)@$(POD_NAME):5432/postgres?sslmode=require"
We will be creating a new cluster, with one node only, and monitoring set up.
The image part is self explanatory. We only need to pass in a single environment variable to make it work, and that is DATA_SOURCE_NAME - which defines with what credentials and to what resource we want to connect. We are getting the password for user prometheus into the env var MONPASS via a secretKeyRef (meaning; get it from the kubernets secret), and make sure we are refering to the correct secret. The POD_NAME is the internal hostname for the pod, and we are using ssl since pg_hba for spilo sets up an hostssl line in pg_hba.conf.
We can now apply this manifest and see what happens:
$ kubectl apply -f monitoring-postgres-manifest.yaml postgresql.acid.zalan.do/acid-monitoring-cluster created $ sleep 10s $ kubectl get pods NAME READY STATUS RESTARTS AGE acid-monitoring-cluster-0 2/2 Running 0 4m28s postgres-operator-599fd68d95-2q8t6 1/1 Running 0 121m
Note that each cluster pod now has 2 containers. The other one is the prometheus container. Let’s connect to the cluster container, and see if we can get metrics:
postgres@acid-monitoring-cluster-0:~$ curl -s http://localhost:9187/metrics | grep pg_up # HELP pg_up Whether the last scrape of metrics from PostgreSQL was able to connect to the server (1 for yes, 0 for no). # TYPE pg_up gauge pg_up 1
Very well, we now have a cluster we can scrape metrics from. Setting up a telegraf sidecar for reporting to influxDB would follow the same pattern, using environment variables to tell telegraf how to connect - and which influxDB to report to. Let’s tear the cluster down before continuing:
$ kubectl delete postgresql acid-monitoring-cluster postgresql.acid.zalan.do "acid-monitoring-cluster" deleted
At this point, there should be no clusters left - just the backups in GCS.
Now, imagine we lost our entire kubernetes cluster. How do we get our data back? First of all, we need to set up the operator again. Then, using the uid we found earlier and the GCS backups that we still have (right?), we can recreate a cluster with this yaml:
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: acid-minimal-cluster namespace: default spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 2 postgresql: version: "11" clone: uid: 23fbac89-de13-11e9-9fe2-42010aa600ea name: acid-minimal-cluster timestamp: "2019-09-23T17:10:00+02:00"
You can put the timestamp in the future to recover all the way to the last entry in the transaction log.
Then we can create it using:
$ kubectl create -f minimal-restore-manifest.yaml postgresql.acid.zalan.do/acid-minimal-cluster created $ kubectl get pods NAME READY STATUS RESTARTS AGE acid-minimal-cluster-0 1/1 Running 0 37s acid-minimal-cluster-1 1/1 Running 0 23s postgres-operator-599fd68d95-2q8t6 1/1 Running 0 3h51m
We should then be able to connect and find our data restored.
This is as simple as changing the image definition in the CRD configuration, and applying the configuration. Follow up by deleting the postgres-operator-
This will trigger a rolling update of the pods - standbys first, masters last.
$ kubectl apply -f manifests/postgresql-operator-default-configuration.yaml #Changed $ kubectl delete pod postgres-operator-599fd68d95-2q8t6
Be careful though, this will update all clusters the operator is watching at the same time.
This one is a bit more complex - the operator does not support doing it directly. However, the spilo image can detect that a clone is performed from an older major release, and update the cluster files before starting. Hence, if you perform a clone to a newer major version, you should be good to go. This is the exact same procedure as the disaster recovery scenario, but with the major version changed.
Some notes on kubernetes and memory limits
One problem with PostgreSQL in kubernetes is sizing pods for your workload. PostgreSQL uses memory quite actively, and as such, there are several things to be aware of:
- shared_buffers takes out a fixed amount of memory
- Each connection takes out a small slice of memory
- Every query requires memory to create and deliver result sets
- Builtin utility and background processes need memory
- There’s no swap in kubernetes…
It’s almost never a good idea to overdo the static parts (shared_buffers), and forget the memory other processes need. First of all, PostgreSQL relies heavily on memory being available. Second, if you hit the hard limit for a pod, kubernetes will kill it. No questions asked.
So how do we avoid that?
Spilo will default shared_buffers to 20% of the memory limit unless we tell it otherwise. If no limit is explicitly given, the default config in postgres-operator says to set 1G as memory limit. This is done so that there is memory available for the bullet points above. We recommend you do not override the shared_buffers setting.
work_mem is set at 4MB by default, and we also recommend you leave it - unless you get heavy usage of temp files (which, of course, you are monitoring). But beware that this can happen with single queries running often - can you improve the query? In many cases, you can.
TL;DR: For development or smaller clusters, the defaults should get you a long way. For larger clusters, we need more experience, but keeping shared_buffers moderate is likely a good candidate. The effects of downsizing shared_buffers in kubernetes are likely to increase the io load, but will increase the memory available for queries and background jobs.
Notes on backup with WAL-E
We’ve relied on WAL-E for this article since it makes it easy to set up backups, and it plays well with the postgres-operator. The timestamp-related bug mentioned does make it a bit tricky at the moment to fully exploit WAL-E in production.
The problem with using other backup solutions like Barman or PGBackrest is leaving the domain of the operator and the spilo image, so quite a bit of work has to be made to create a wellfunctioning backup and restore regime.
One of the neat advantages with a setup like this with WAL-E is the simplicity of testing recovery - it can be automated with much less work than for a physical or virtualized cluster.
Unfortunately, the development of WAL-E is rather nonexistent. With the exception of the timestamp bug though, it works for this purpose.
Apart from the WAL-E limitations encountered, there is more than enough quality in the solutions mapped here to use this setup in production. We have customers in early production, and others now moving to the testing phase. We’ve seen the dreaded OOM killer a few times, but have been able to give the pods enough resources without going too far. Even so - in kubernetes your application should be able to survive minor downtimes if pods die. That includes the database connection. Designing for it is a good idea.
The postgres-operator provides a very nice way to allow development teams to take control of the databases, and to manage them in the same way as they do for the rest of the application or service - including disk and memory allocations. Kubernetes secrets keep secrets from spilling out into version control.
For larger monolithic databases and applications that are not on kubernetes, you will probably want to stay on servers a bit longer.
If you are already on Kubernetes, and have smaller databases (not unlikely for microservices for instance), the postgres-operator should provide operations with less headaches, and the developers with more control (and of course, more responsibility).