I run multiple web services, mostly related to i3wm.org. All of them use PostgreSQL as their database, so the data that is stored in that PostgreSQL database is pretty important to me and the users of these services.
Since a while now, I have been thinking about storing that data in a more reliable way. Currently, it is stored on a single server, and is backed up to two different locations (one on-site, one off-site) every day. The server in question has a RAID-1 of course, but still: the setup implies that if that one server dies, the last backup may be about a day old in the worst case, and also it could take me significant time to get the services back up.
The areas in which I’d like to improve my setup are thus:
- Durability: In case the entire server dies, I want to have an up-to-date copy of all data.
- Fault tolerance: In case the entire server dies, I want to be able to quickly switch to a different server. A secondary machine should be ready to take over, albeit not fully automatically because fully automatic solutions typically are either fragile or require a higher number of servers than I’m willing to afford.
For PostgreSQL, there are various settings and additional programs that you can use which will provide you with some sort of clustering/replication. There is an overview in the PostgreSQL wiki (“Replication, Clustering, and Connection Pooling”). My solution of choice is pgpool2 because it seems robust and mature (yet under active development) to me, it is reasonably well documented and I think I roughly understand what it does under the covers.
I have two servers, located in different data centers, that I will use for this setup. The number of servers does not really matter, meaning you can easily add a third or fourth server (increasing latency with every server of course). However, the low number of servers places some restrictions on what we can do. As an example, solutions that involve global consistency based on paxos/raft quorums will not work with only two servers. As a consequence, master election is out of the question and a human will need to do the actual failover/recovery.
Each of the two servers will run PostgreSQL, but only one of them will run pgpool2 at a time. The DNS records for e.g. faq.i3wm.org will point to the server on which pgpool2 is running, so that server handles 100% of the traffic. Let’s call the server running pgpool2 the primary, and the other server the secondary. All queries that modify the database will still be sent to the secondary, but the secondary does not handle any user traffic. This could be accomplished by either not running the applications in question, or by having them connect to the pgpool2 on the primary.
When a catastrophe happens, the DNS records will be switched to point to the old-secondary server, and pgpool2 will be started there. Once the old-primary server is available again, it will become the secondary server, so that in case of another catastrophe, the same procedure can be executed again.
With a solution that involves only two servers, an often encountered problem are split-brain situations. This means both servers think they are primary, typically because there is a network partition, meaning the servers cannot talk to each other. In our case, it is important that user traffic is not handled by the secondary server. This could happen after failing over because DNS heavily relies on caching, so switching the record does not mean that suddenly all queries will go to the other server — this will only happen over time. A solution for that is to either kill pgpool2 manually if possible, or have a tool that kills pgpool2 when it cannot verify that the DNS record points to the server.
I apologize for the overly long lines in some places, but there does not seem to be a way to use line continuations in the PostgreSQL configuration file.
Installing and configuring PostgreSQL
The following steps need to be done on each database server, whereas pgpool2 will only be installed on precisely one server.
Also note that a prerequisite for the configuration described below is that
hostnames are configured properly on every involved server, i.e.
-f should return the fully qualified hostname of the server in question,
and other servers must be able to connect to that hostname.
apt-get install postgresql postgresql-9.4-pgpool2 rsync ssh cat >>/etc/postgresql/9.4/main/postgresql.conf <<'EOT' listen_addresses = '*' max_wal_senders = 1 wal_level = hot_standby archive_mode = on archive_command = 'test ! -f /var/lib/postgresql/9.4/main/archive_log/backup_in_progress || (test -f /var/lib/postgresql/9.4/main/archive_log/%f || cp %p /var/lib/postgresql/9.4/main/archive_log/%f)' EOT install -o postgres -g postgres -m 700 -d \ /var/lib/postgresql/9.4/main/archive_log systemctl restart postgresql.service
pgpool comes with an extension (implemented in C) that provides a couple of
functions which are necessary for recovery. We need to “create” the extension
in order to be able to use these functions. After running the following
command, you can double-check with
\dx that the extension was
echo 'CREATE EXTENSION "pgpool_recovery"' | \ su - postgres -c 'psql template1'
During recovery, pgpool needs to synchronize data between the PostgreSQL
servers. This is done partly by running
pg_basebackup on the
recovery target via SSH and using
rsync (which connects using
SSH). Therefore, we need to create a passwordless SSH key for the
postgres user. For simplicity, I am implying that you’ll copy the
authorized_keys files onto every
database node. You’ll also need to connect to every other database server once
in order to get the SSH host fingerprints into the known_hosts file.
su - postgres ssh-keygen -f /var/lib/postgresql/.ssh/id_rsa -N '' cat .ssh/id_rsa.pub >> .ssh/authorized_keys exit
We’ll also need to access remote databases with
non-interactively, so we need a password file:
su - postgres echo '*:*:*:postgres:wQgvBEusf1NWDRKVXS15Fc8' > .pgpass chmod 0600 .pgpass exit
When pgpool recovers a node, it first makes sure the data directory is up to
date, then it starts PostgreSQL and tries to connect repeatedly. Once the
connection succeeded, the node is considered healthy. Therefore, we need to
postgres user permission to control
apt-get install sudo cat >/etc/sudoers.d/pgpool-postgres <<'EOT' postgres ALL=(ALL:ALL) NOPASSWD:/bin/systemctl start postgresql.service postgres ALL=(ALL:ALL) NOPASSWD:/bin/systemctl stop postgresql.service EOT
Now enable password-based authentication for all databases and replication traffic. In case your database nodes/clients don’t share a common hostname suffix, you may need to use multiple entries or replace the hostname suffix by “all”.
cat >>/etc/postgresql/9.4/main/pg_hba.conf <<'EOT' host all all .zekjur.net md5 host replication postgres .zekjur.net md5 EOT
After enabling password-based authentication, we need to set a password for the
postgres user which we’ll use for making the base backup:
echo "ALTER USER postgres WITH PASSWORD 'wQgvBEusf1NWDRKVXS15Fc8';" | \ su postgres -c psql
apt-get install pgpool2 cd /etc/pgpool2 gunzip -c /usr/share/doc/pgpool2/examples/\ pgpool.conf.sample-replication.gz > pgpool.conf
To interact with pgpool2, there are a few command-line utilities whose name
pcp_. In order for these to work, we must configure a
username and password. For simplicity, I’ll re-use the password we set earlier
postgres user, but you could chose to use an entirely
echo "postgres:$(pg_md5 wQgvBEusf1NWDRKVXS15Fc8)" >> pcp.conf
In replication mode, when the client should authenticate towards the PostgreSQL database, we also need to tell pgpool2 that we are using password-based authentication:
sed -i 's/trust$/md5/g' pool_hba.conf sed -i 's/\(enable_pool_hba =\) off/\1 on/g' pgpool.conf
Furthermore, we need to provide all the usernames and passwords that we are going to use to pgpool2:
touch pool_passwd chown postgres.postgres pool_passwd pg_md5 -m -u faq_i3wm_org secretpassword
For the use-case I am describing here, it is advisable to turn off
load_balance_mode, otherwise queries will be sent to all healthy
backends, which is slow because they are not in the same network. In addition,
we’ll assign a higher weight to the backend which runs on the same machine as
pgpool2, so read-only queries are sent to the local backend only.
sed -i 's/^load_balance_mode = on/load_balance_mode = off/g' \ pgpool.conf
Now, we need to configure the backends.
sed -i 's/^\(backend_\)/# \1/g' pgpool.conf cat >>pgpool.conf <<'EOT' backend_hostname0 = 'midna.zekjur.net' backend_port0 = 5432 backend_weight0 = 2 backend_data_directory0 = '/var/lib/postgresql/9.4/main' backend_hostname1 = 'alp.zekjur.net' backend_port1 = 5432 backend_weight1 = 1 backend_data_directory1 = '/var/lib/postgresql/9.4/main' EOT
Overview: How recovery works
Let’s assume that pgpool is running on
midna is handling all the traffic), and
alp.zekjur.net crashed. pgpool will automatically degrade
alp and continue operation. When you tell it to recover
alp because the machine is available again, it will do three
(“1st stage”) SSH into
alpand run pg_basebackup to get a copy of
(“2nd stage”) Disconnect all clients so that the database on
midnawill not be modified anymore. Flush all data to disk on
midna, then rsync the data to
alp. pg_basebackup from 1st stage will have copied almost all of it, so this is a small amount of data — typically on the order of 16 MB, because that’s how big one WAL file is.
Try to start PostgreSQL on
alpagain. pgpool will wait for 90 seconds by default, and within that time PostgreSQL must start up in such a state that pgpool can connect to it.
So, during the 1st stage, which copies the entire database, traffic will still be handled normally, only during 2nd stage and until PostgreSQL started up no queries are served.
For recovery, we need to provide pgpool2 with a couple of shell scripts that handle the details of how the recovery is performed.
sed -i 's/^\(recovery_\|client_idle_limit_in_recovery\)/# \1/g' \ pgpool.conf cat >>pgpool.conf <<'EOT' recovery_user = 'postgres' recovery_password = 'wQgvBEusf1NWDRKVXS15Fc8' # This script is being run by invoking the pgpool_recovery() function on # the current master(primary) postgresql server. pgpool_recovery() is # essentially a wrapper around system(), so it runs under your database # UNIX user (typically "postgres"). # Both scripts are located in /var/lib/postgresql/9.4/main/ recovery_1st_stage_command = '1st_stage.sh' recovery_2nd_stage_command = '2nd_stage.sh' # Immediately disconnect all clients when entering the 2nd stage recovery # instead of waiting for the clients to disconnect. client_idle_limit_in_recovery = -1 EOT
The 1st_stage.sh script logs into the backend that should be recovered and uses
pg_basebackup to copy a full backup from the master(primary)
backend. It also sets up the
recovery.conf which will be used by
PostgreSQL when starting up.
cat >/var/lib/postgresql/9.4/main/1st_stage.sh <<'EOF' #!/bin/sh TS=$(date +%Y-%m-%d_%H-%M-%S) MASTER_HOST=$(hostname -f) MASTER_DATA=$1 RECOVERY_TARGET=$2 RECOVERY_DATA=$3 # Move the PostgreSQL data directory out of our way. ssh -T $RECOVERY_TARGET \ "[ -d $RECOVERY_DATA ] && mv $RECOVERY_DATA $RECOVERY_DATA.$TS" # We only use archived WAL logs during recoveries, so delete all # logs from the last recovery to limit the growth. rm $MASTER_DATA/archive_log/* # With this file present, our archive_command will actually # archive WAL files. touch $MASTER_DATA/archive_log/backup_in_progress # Perform a backup of the database. ssh -T $RECOVERY_TARGET \ "pg_basebackup -h $MASTER_HOST -D $RECOVERY_DATA --xlog" # Configure the restore_command to use the archive_log WALs we’ll copy # over in 2nd_stage.sh. echo "restore_command = 'cp $RECOVERY_DATA/archive_log/%f %p'" | \ ssh -T $RECOVERY_TARGET "cat > $RECOVERY_DATA/recovery.conf" EOF
cat >/var/lib/postgresql/9.4/main/2nd_stage.sh <<'EOF' #! /bin/sh MASTER_DATA=$1 RECOVERY_TARGET=$2 RECOVERY_DATA=$3 port=5432 # Force to flush current value of sequences to xlog psql -p $port -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1| while read i do if [ "$i" != "" ];then psql -p $port -c "SELECT setval(oid, nextval(oid)) FROM pg_class WHERE relkind = 'S'" $i fi done # Flush all transactions to disk. Since pgpool stopped all connections, # there cannot be any data that does not reside on disk until the # to-be-recovered host is back on line. psql -p $port -c "SELECT pgpool_switch_xlog('$MASTER_DATA/archive_log')" template1 # Copy over all archive logs at once. rsync -avx --delete $MASTER_DATA/archive_log/ \ $RECOVERY_TARGET:$RECOVERY_DATA/archive_log/ # Delete the flag file to disable WAL archiving again. rm $MASTER_DATA/archive_log/backup_in_progress EOF
cat >/var/lib/postgresql/9.4/main/pgpool_remote_start <<'EOF' #!/bin/sh ssh $1 sudo systemctl start postgresql.service EOF chmod +x /var/lib/postgresql/9.4/main/1st_stage.sh chmod +x /var/lib/postgresql/9.4/main/2nd_stage.sh chmod +x /var/lib/postgresql/9.4/main/pgpool_remote_start
Now, let’s start pgpool2 and verify that it works and that we can access our
first node. The
pcp_node_count command should return an integer
number like “2”. The psql command should be able to connect and you should see
your database tables when using
systemctl restart pgpool2.service pcp_node_count 10 localhost 9898 postgres wQgvBEusf1NWDRKVXS15Fc8 psql -p 5433 -U faq_i3wm_org faq_i3wm_org
pgpool2 intercepts a couple of SHOW statements, so you can use the SQL command
SHOW pool_nodes to see how many nodes are there:
> SHOW pool_nodes; node_id | hostname | port | status | lb_weight | role ---------+------------------+------+--------+-----------+-------- 0 | midna.zekjur.net | 5432 | 2 | 0.666667 | master 1 | alp.zekjur.net | 5432 | 2 | 0.333333 | slave (2 rows)
You could export a cgi-script over HTTP, which just always runs this command,
and then configure your monitoring software to watch for certain strings in the
output. Note that you’ll also need to configure a
www-data user. As an example, to monitor whether
alp is still a healthy backend, match for “alp.zekjur.net,5432,2”
in the output of this script:
#!/bin/sh cat <<'EOT' Content-type: text/plain EOT exec echo 'SHOW pool_nodes;' | psql -t -A -F, --host localhost \ -U faq_i3wm_org faq_i3wm_org
Performing/Debugging a recovery
In order to recover node 1 (
alp in this case), use:
pcp_recovery_node 300 localhost 9898 postgres wQgvBEusf1NWDRKVXS15Fc8 1
The “300” used to be a timeout, but these days it’s only supported for backwards compatibility and has no effect.
In case the recovery fails, the only thing you’ll get back from pcp_recovery_node is the text “BackendError”, which is not very helpful. The logfile of pgpool2 contains a bit more information, but to debug recovery problems, I typically strace all PostgreSQL processes and see what the scripts are doing/where they are failing.
pgpool2 behavior during recovery
In order to see how pgpool2 performs during recovery/degradation, you can use
this little Go program that
tries to do three things every 0.25 seconds: check that the database is healthy
SELECT 1;), run a meaningful SELECT, run an UPDATE.
When a database node goes down, a single query may fail until pgpool2 realizes that the node needs to be degraded. If your database load is light, chances are that pgpool2 will realize the database is down without even failing a single query, though.
2014-08-13 23:15:27.638 health: ✓ select: ✓ update: ✓ 2014-08-13 23:15:28.700 insert failed: driver: bad connection 2014-08-13 23:15:28.707 health: ✓ select: ✓ update: x
During recovery, there is a time when pgpool2 will just disconnect all clients and not answer any queries any more (2nd stage). In this case, the state lasted for about 20 seconds:
… 2014-08-13 23:16:01.900 health: ✓ select: ✓ update: ✓ 2014-08-13 23:16:02.161 health: ✓ select: ✓ update: ✓ # no queries answered here 2014-08-13 23:16:23.625 health: ✓ select: ✓ update: ✓ 2014-08-13 23:16:24.308 health: ✓ select: ✓ update: ✓ …
Setting up a PostgreSQL setup that involves pgpool2 is definitely a lot of work. It could be a bit easier if the documentation was more specific on the details of how recovery is supposed to work and would include the configuration that I came up with above. Ideally, something like pgpool2 would be part of PostgreSQL itself.
I am not yet sure how much software I’ll need to touch in order to make it gracefully deal with the PostgreSQL connection dying and coming back up. I know of at least one program I use (buildbot) which does not handle this situation well at all — it needs a complete restart to work again.
Time will tell if the setup is stable and easy to maintain. In case I make negative experiences, I’ll update this article :).