Replicated PostgreSQL with pgpool2

(2014-08-31)

I run multiple web services, mostly related to i3wm.org. All of them use PostgreSQL as their database, so the data that is stored in that PostgreSQL database is pretty important to me and the users of these services.

Since a while now, I have been thinking about storing that data in a more reliable way. Currently, it is stored on a single server, and is backed up to two different locations (one on-site, one off-site) every day. The server in question has a RAID-1 of course, but still: the setup implies that if that one server dies, the last backup may be about a day old in the worst case, and also it could take me significant time to get the services back up.

The areas in which I’d like to improve my setup are thus:

  1. Durability: In case the entire server dies, I want to have an up-to-date copy of all data.
  2. Fault tolerance: In case the entire server dies, I want to be able to quickly switch to a different server. A secondary machine should be ready to take over, albeit not fully automatically because fully automatic solutions typically are either fragile or require a higher number of servers than I’m willing to afford.

For PostgreSQL, there are various settings and additional programs that you can use which will provide you with some sort of clustering/replication. There is an overview in the PostgreSQL wiki (“Replication, Clustering, and Connection Pooling”). My solution of choice is pgpool2 because it seems robust and mature (yet under active development) to me, it is reasonably well documented and I think I roughly understand what it does under the covers.

The plan

I have two servers, located in different data centers, that I will use for this setup. The number of servers does not really matter, meaning you can easily add a third or fourth server (increasing latency with every server of course). However, the low number of servers places some restrictions on what we can do. As an example, solutions that involve global consistency based on paxos/raft quorums will not work with only two servers. As a consequence, master election is out of the question and a human will need to do the actual failover/recovery.

Each of the two servers will run PostgreSQL, but only one of them will run pgpool2 at a time. The DNS records for e.g. faq.i3wm.org will point to the server on which pgpool2 is running, so that server handles 100% of the traffic. Let’s call the server running pgpool2 the primary, and the other server the secondary. All queries that modify the database will still be sent to the secondary, but the secondary does not handle any user traffic. This could be accomplished by either not running the applications in question, or by having them connect to the pgpool2 on the primary.

When a catastrophe happens, the DNS records will be switched to point to the old-secondary server, and pgpool2 will be started there. Once the old-primary server is available again, it will become the secondary server, so that in case of another catastrophe, the same procedure can be executed again.

With a solution that involves only two servers, an often encountered problem are split-brain situations. This means both servers think they are primary, typically because there is a network partition, meaning the servers cannot talk to each other. In our case, it is important that user traffic is not handled by the secondary server. This could happen after failing over because DNS heavily relies on caching, so switching the record does not mean that suddenly all queries will go to the other server — this will only happen over time. A solution for that is to either kill pgpool2 manually if possible, or have a tool that kills pgpool2 when it cannot verify that the DNS record points to the server.

Configuration

I apologize for the overly long lines in some places, but there does not seem to be a way to use line continuations in the PostgreSQL configuration file.

Installing and configuring PostgreSQL

The following steps need to be done on each database server, whereas pgpool2 will only be installed on precisely one server.

Also note that a prerequisite for the configuration described below is that hostnames are configured properly on every involved server, i.e. hostname -f should return the fully qualified hostname of the server in question, and other servers must be able to connect to that hostname.

apt-get install postgresql postgresql-9.4-pgpool2 rsync ssh
cat >>/etc/postgresql/9.4/main/postgresql.conf <<'EOT'
listen_addresses = '*'

max_wal_senders = 1
wal_level = hot_standby
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/9.4/main/archive_log/backup_in_progress || (test -f /var/lib/postgresql/9.4/main/archive_log/%f || cp %p /var/lib/postgresql/9.4/main/archive_log/%f)'
EOT
install -o postgres -g postgres -m 700 -d \
  /var/lib/postgresql/9.4/main/archive_log
systemctl restart postgresql.service

pgpool comes with an extension (implemented in C) that provides a couple of functions which are necessary for recovery. We need to “create” the extension in order to be able to use these functions. After running the following command, you can double-check with \dx that the extension was installed properly.

echo 'CREATE EXTENSION "pgpool_recovery"' | \
  su - postgres -c 'psql template1'

During recovery, pgpool needs to synchronize data between the PostgreSQL servers. This is done partly by running pg_basebackup on the recovery target via SSH and using rsync (which connects using SSH). Therefore, we need to create a passwordless SSH key for the postgres user. For simplicity, I am implying that you’ll copy the same id_rsa and authorized_keys files onto every database node. You’ll also need to connect to every other database server once in order to get the SSH host fingerprints into the known_hosts file.

su - postgres
ssh-keygen -f /var/lib/postgresql/.ssh/id_rsa -N ''
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
exit

We’ll also need to access remote databases with pg_basebackup non-interactively, so we need a password file:

su - postgres
echo '*:*:*:postgres:wQgvBEusf1NWDRKVXS15Fc8' > .pgpass
chmod 0600 .pgpass
exit

When pgpool recovers a node, it first makes sure the data directory is up to date, then it starts PostgreSQL and tries to connect repeatedly. Once the connection succeeded, the node is considered healthy. Therefore, we need to give the postgres user permission to control postgresql.service:

apt-get install sudo
cat >/etc/sudoers.d/pgpool-postgres <<'EOT'
postgres ALL=(ALL:ALL) NOPASSWD:/bin/systemctl start postgresql.service
postgres ALL=(ALL:ALL) NOPASSWD:/bin/systemctl stop postgresql.service
EOT

Now enable password-based authentication for all databases and replication traffic. In case your database nodes/clients don’t share a common hostname suffix, you may need to use multiple entries or replace the hostname suffix by “all”.

cat >>/etc/postgresql/9.4/main/pg_hba.conf <<'EOT'
host    all             all             .zekjur.net             md5     
host    replication     postgres        .zekjur.net             md5     
EOT

After enabling password-based authentication, we need to set a password for the postgres user which we’ll use for making the base backup:

echo "ALTER USER postgres WITH PASSWORD 'wQgvBEusf1NWDRKVXS15Fc8';" | \
  su postgres -c psql

Installing pgpool2

apt-get install pgpool2
cd /etc/pgpool2
gunzip -c /usr/share/doc/pgpool2/examples/\
pgpool.conf.sample-replication.gz > pgpool.conf

To interact with pgpool2, there are a few command-line utilities whose name starts with pcp_. In order for these to work, we must configure a username and password. For simplicity, I’ll re-use the password we set earlier for the postgres user, but you could chose to use an entirely different username/password:

echo "postgres:$(pg_md5 wQgvBEusf1NWDRKVXS15Fc8)" >> pcp.conf

In replication mode, when the client should authenticate towards the PostgreSQL database, we also need to tell pgpool2 that we are using password-based authentication:

sed -i 's/trust$/md5/g' pool_hba.conf
sed -i 's/\(enable_pool_hba =\) off/\1 on/g' pgpool.conf

Furthermore, we need to provide all the usernames and passwords that we are going to use to pgpool2:

touch pool_passwd
chown postgres.postgres pool_passwd
pg_md5 -m -u faq_i3wm_org secretpassword

For the use-case I am describing here, it is advisable to turn off load_balance_mode, otherwise queries will be sent to all healthy backends, which is slow because they are not in the same network. In addition, we’ll assign a higher weight to the backend which runs on the same machine as pgpool2, so read-only queries are sent to the local backend only.

sed -i 's/^load_balance_mode = on/load_balance_mode = off/g' \
    pgpool.conf

Now, we need to configure the backends.

sed -i 's/^\(backend_\)/# \1/g' pgpool.conf

cat >>pgpool.conf <<'EOT'
backend_hostname0 = 'midna.zekjur.net'
backend_port0 = 5432
backend_weight0 = 2
backend_data_directory0 = '/var/lib/postgresql/9.4/main'

backend_hostname1 = 'alp.zekjur.net'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/postgresql/9.4/main'
EOT

Overview: How recovery works

Let’s assume that pgpool is running on midna.zekjur.net (so midna is handling all the traffic), and alp.zekjur.net crashed. pgpool will automatically degrade alp and continue operation. When you tell it to recover alp because the machine is available again, it will do three things:

  1. (“1st stage”) SSH into alp and run pg_basebackup to get a copy of midna’s database.
  2. (“2nd stage”) Disconnect all clients so that the database on midna will not be modified anymore. Flush all data to disk on midna, then rsync the data to alp. pg_basebackup from 1st stage will have copied almost all of it, so this is a small amount of data — typically on the order of 16 MB, because that’s how big one WAL file is.
  3. Try to start PostgreSQL on alp again. pgpool will wait for 90 seconds by default, and within that time PostgreSQL must start up in such a state that pgpool can connect to it.

So, during the 1st stage, which copies the entire database, traffic will still be handled normally, only during 2nd stage and until PostgreSQL started up no queries are served.

Configuring recovery

For recovery, we need to provide pgpool2 with a couple of shell scripts that handle the details of how the recovery is performed.

sed -i 's/^\(recovery_\|client_idle_limit_in_recovery\)/# \1/g' \
    pgpool.conf

cat >>pgpool.conf <<'EOT'
recovery_user = 'postgres'
recovery_password = 'wQgvBEusf1NWDRKVXS15Fc8'

# This script is being run by invoking the pgpool_recovery() function on
# the current master(primary) postgresql server. pgpool_recovery() is
# essentially a wrapper around system(), so it runs under your database
# UNIX user (typically "postgres").
# Both scripts are located in /var/lib/postgresql/9.4/main/
recovery_1st_stage_command = '1st_stage.sh'
recovery_2nd_stage_command = '2nd_stage.sh'

# Immediately disconnect all clients when entering the 2nd stage recovery
# instead of waiting for the clients to disconnect.
client_idle_limit_in_recovery = -1
EOT

The 1st_stage.sh script logs into the backend that should be recovered and uses pg_basebackup to copy a full backup from the master(primary) backend. It also sets up the recovery.conf which will be used by PostgreSQL when starting up.

cat >/var/lib/postgresql/9.4/main/1st_stage.sh <<'EOF'
#!/bin/sh
TS=$(date +%Y-%m-%d_%H-%M-%S)
MASTER_HOST=$(hostname -f)
MASTER_DATA=$1
RECOVERY_TARGET=$2
RECOVERY_DATA=$3

# Move the PostgreSQL data directory out of our way.
ssh -T $RECOVERY_TARGET \
    "[ -d $RECOVERY_DATA ] && mv $RECOVERY_DATA $RECOVERY_DATA.$TS"

# We only use archived WAL logs during recoveries, so delete all
# logs from the last recovery to limit the growth.
rm $MASTER_DATA/archive_log/*

# With this file present, our archive_command will actually
# archive WAL files.
touch $MASTER_DATA/archive_log/backup_in_progress

# Perform a backup of the database.
ssh -T $RECOVERY_TARGET \
    "pg_basebackup -h $MASTER_HOST -D $RECOVERY_DATA --xlog"

# Configure the restore_command to use the archive_log WALs we’ll copy
# over in 2nd_stage.sh.
echo "restore_command = 'cp $RECOVERY_DATA/archive_log/%f %p'" | \
    ssh -T $RECOVERY_TARGET "cat > $RECOVERY_DATA/recovery.conf"
EOF
cat >/var/lib/postgresql/9.4/main/2nd_stage.sh <<'EOF'
#! /bin/sh
MASTER_DATA=$1
RECOVERY_TARGET=$2
RECOVERY_DATA=$3
port=5432

# Force to flush current value of sequences to xlog
psql -p $port -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1|
while read i
do
  if [ "$i" != "" ];then
    psql -p $port -c "SELECT setval(oid, nextval(oid)) FROM pg_class WHERE relkind = 'S'" $i
  fi
done

# Flush all transactions to disk. Since pgpool stopped all connections,
# there cannot be any data that does not reside on disk until the
# to-be-recovered host is back on line.
psql -p $port -c "SELECT pgpool_switch_xlog('$MASTER_DATA/archive_log')" template1

# Copy over all archive logs at once.
rsync -avx --delete $MASTER_DATA/archive_log/ \
    $RECOVERY_TARGET:$RECOVERY_DATA/archive_log/

# Delete the flag file to disable WAL archiving again.
rm $MASTER_DATA/archive_log/backup_in_progress
EOF
cat >/var/lib/postgresql/9.4/main/pgpool_remote_start <<'EOF'
#!/bin/sh
ssh $1 sudo systemctl start postgresql.service
EOF

chmod +x /var/lib/postgresql/9.4/main/1st_stage.sh
chmod +x /var/lib/postgresql/9.4/main/2nd_stage.sh
chmod +x /var/lib/postgresql/9.4/main/pgpool_remote_start

Now, let’s start pgpool2 and verify that it works and that we can access our first node. The pcp_node_count command should return an integer number like “2”. The psql command should be able to connect and you should see your database tables when using \d.

systemctl restart pgpool2.service
pcp_node_count 10 localhost 9898 postgres wQgvBEusf1NWDRKVXS15Fc8
psql -p 5433 -U faq_i3wm_org faq_i3wm_org

Monitoring

pgpool2 intercepts a couple of SHOW statements, so you can use the SQL command SHOW pool_nodes to see how many nodes are there:

> SHOW pool_nodes;
 node_id |     hostname     | port | status | lb_weight |  role  
---------+------------------+------+--------+-----------+--------
 0       | midna.zekjur.net | 5432 | 2      | 0.666667  | master
 1       | alp.zekjur.net   | 5432 | 2      | 0.333333  | slave
(2 rows)

You could export a cgi-script over HTTP, which just always runs this command, and then configure your monitoring software to watch for certain strings in the output. Note that you’ll also need to configure a ~/.pgpass file for the www-data user. As an example, to monitor whether alp is still a healthy backend, match for “alp.zekjur.net,5432,2” in the output of this script:

#!/bin/sh
cat <<'EOT'
Content-type: text/plain

EOT
exec echo 'SHOW pool_nodes;' | psql -t -A -F, --host localhost \
  -U faq_i3wm_org faq_i3wm_org

Performing/Debugging a recovery

In order to recover node 1 (alp in this case), use:

pcp_recovery_node 300 localhost 9898 postgres wQgvBEusf1NWDRKVXS15Fc8 1

The “300” used to be a timeout, but these days it’s only supported for backwards compatibility and has no effect.

In case the recovery fails, the only thing you’ll get back from pcp_recovery_node is the text “BackendError”, which is not very helpful. The logfile of pgpool2 contains a bit more information, but to debug recovery problems, I typically strace all PostgreSQL processes and see what the scripts are doing/where they are failing.

pgpool2 behavior during recovery

In order to see how pgpool2 performs during recovery/degradation, you can use this little Go program that tries to do three things every 0.25 seconds: check that the database is healthy (SELECT 1;), run a meaningful SELECT, run an UPDATE.

When a database node goes down, a single query may fail until pgpool2 realizes that the node needs to be degraded. If your database load is light, chances are that pgpool2 will realize the database is down without even failing a single query, though.

2014-08-13 23:15:27.638 health: ✓  select: ✓  update: ✓
2014-08-13 23:15:28.700 insert failed: driver: bad connection
2014-08-13 23:15:28.707 health: ✓  select: ✓  update: x

During recovery, there is a time when pgpool2 will just disconnect all clients and not answer any queries any more (2nd stage). In this case, the state lasted for about 20 seconds:

…
2014-08-13 23:16:01.900 health: ✓  select: ✓  update: ✓
2014-08-13 23:16:02.161 health: ✓  select: ✓  update: ✓
# no queries answered here
2014-08-13 23:16:23.625 health: ✓  select: ✓  update: ✓
2014-08-13 23:16:24.308 health: ✓  select: ✓  update: ✓
…

Conclusion

Setting up a PostgreSQL setup that involves pgpool2 is definitely a lot of work. It could be a bit easier if the documentation was more specific on the details of how recovery is supposed to work and would include the configuration that I came up with above. Ideally, something like pgpool2 would be part of PostgreSQL itself.

I am not yet sure how much software I’ll need to touch in order to make it gracefully deal with the PostgreSQL connection dying and coming back up. I know of at least one program I use (buildbot) which does not handle this situation well at all — it needs a complete restart to work again.

Time will tell if the setup is stable and easy to maintain. In case I make negative experiences, I’ll update this article :).