What is pg_rewind?#
pg_rewind is a PostgreSQL-provided tool. When the timelines of two PG instances diverge, pg_rewind can synchronize them. (For example, the primary is running, the standby failover has been running for a while — at this point the primary and standby timelines have diverged.)
pg_rewind compares the sizes of files between the source and target, then copies differing files from source to target, including configuration files. However, it does not compare unchanged files, so pg_rewind runs efficiently on large databases with few changes.
pg_rewind can be used after a standby failover: even if the standby has been running independently for some time, it can be pulled back to the same state as the primary and become a standby again.
During execution, pg_rewind compares the divergence point between primary (source) and standby (target), and transmits the primary’s WAL logs after the divergence point to the standby. Therefore, if the primary’s WAL after the divergence point is also lost, rewind won’t copy nonexistent WAL logs, and the standby will still fail to become a standby. The solution is to use restore.
!!! When using pg_rewind, back up the target instance. pg_rewind directly overwrites the target database’s files. If rewind fails, the target database may be unable to start.
Using pg_rewind#
After a primary-standby switchover, the old primary continues running, causing timeline inconsistency. The old primary cannot start as a standby for the new primary.
When attempting to start the standby, a timeline error appears:
LOG: entering standby mode
FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/6000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/4000098.
LOG: startup process (PID 22321) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut downAt this point, rewind is needed to realign the primary and standby.
- Configure pg_hba on the current primary Set up login permissions for the pg_rewind user to access the source database. hba changes require a database restart.
vi $source/pg_hba.conf
host all pg 172.17.100.150/32 trustpg_rewind requires a high-privilege user. Newer PG versions allow granting privileges; older versions should use a superuser. My environment is PG 9.6, so I use the OS superuser directly.
- wal_log_hints = on parameter configuration
Append
wal_log_hints = onto the target database’s postgres.conf, then start and shut down the target database once (at this point the primary is running and the standby is shut down).
vi $dest/postgres.conf
wal_log_hints = on- Execute pg_rewind
[pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_rewind --target-pgdata /pg/pg96data_pri --source-server='host=172.17.100.150 port=5433 user=pg password=oracle dbname=postgres'
servers diverged at WAL position 0/4000098 on timeline 1
rewinding from last common checkpoint at 0/4000028 on timeline 1
Done!- Configure standby parameters Modify IP, port, directory, etc. in postgres.conf and recovery.conf. pg_rewind also copies configuration files over.
[pg@lzl pg96data_pri]$ mv recovery.done recovery.conf
[pg@lzl pg96data_pri]$ vi recovery.conf
[pg@lzl pg96data_pri]$ vi postgres.conf- Start the standby
[pg@lzl pg96data_pri]$ /pg/pg96/bin/pg_ctl -D /pg/pg96data_sla -l /pg/pg96data_sla/server.log start
server starting
[pg@lzl pg96data_sla]$ psql -p5433 postgres
psql (9.6.17)
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication ;
-[ RECORD 1 ]----+------------------------------
pid | 24766
usesysid | 16384
usename | lzl
application_name | walreceiver
client_addr | 172.17.100.150
client_hostname |
client_port | 47345
backend_start | 2021-07-30 07:44:05.582546+00
backend_xmin |
state | streaming
sent_location | 0/4033790
write_location | 0/4033790
flush_location | 0/4033790
replay_location | 0/4033790
sync_priority | 0
sync_state | asyncCommon Issues#
pg_rewind Error 1#
could not fetch remote file "global/pg_control": ERROR: must be superuser to read files
Failure, exitingSolution: Use a high-privilege user.
postgres=# \du
List of roles
Role name | Attributes | Member of
-------------+------------------------------------------------------------+-----------
lzl | Replication | {}
pg | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
rewind_user | | {}The pg user is the built-in superuser that comes with the PG server, matching the PG installation user. The OS installation user certainly has permission to modify pg_control.
/pg/pg96/bin/pg_rewind --target-pgdata /pg/pg96data_pri --source-server='host=172.17.100.150 port=5433 user=pg password=oracle dbname=postgres'pg_rewind Error 2#
could not connect to server: FATAL: no pg_hba.conf entry for host "172.17.100.150", user "rewind_user", database "postgres"
Failure, exitingNo pg_hba.conf entry configured for the connection. Solution: Configure pg_hba for the user, e.g.:
host all pg 172.17.100.150/32 trustpg_rewind Error 3#
[pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_rewind --target-pgdata /pg/pg96data_pri --source-server='host=172.17.100.150 port=5433 user=pg password=oracle dbname=postgres'
target server needs to use either data checksums or "wal_log_hints = on"Root causes:
- full_page_writes (enabled by default)
- wal_log_hints must be set to on, or PG must have checksums enabled at initdb time.
Solution: Add wal_log_hints = on to the target database’s postgres.conf, then start and shut down the target database once (the target was already shut down — it must be started and shut down again for the parameter to take effect).
vi postgres.conf # add to target database config
wal_log_hints = onRestart the target database to apply:
[pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_ctl -D /pg/pg96data_pri -l /pg/pg96data_pri/server.log start
server starting
[pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_ctl -D /pg/pg96data_pri -l /pg/pg96data_pri/server.log stop
waiting for server to shut down.... done