Problem Symptom — Slow Startup#
Version: PG 13.2
Database startup was slow. The startup process was reading spill files, and the filenames kept changing. Checking the spill files was also very slow — ls -l eventually showed 8 million spill files.
Why Tens of Millions of Spill Files?#
WAL Segment and LSN Meaning#
LSN#
LSN is a 64-bit bigint. An LSN actually looks like 42D3B/1732C540 (hex). Before the slash / is the 32-bit logical log number, and after the / are 32 bits split into segment number + block number + intra-block offset. These 4 parts are:
| 32 bits | 8 bits | 11 bits | 13 bits |
|---|---|---|---|
| Logical log number | Log segment number | Block number | Intra-block offset |
Intra-block offset 8192 = 2^13
Block number = 16M (default WAL segment size) / 8192
WAL Segment#
A WAL filename consists of 3 groups of hex digits.
Taking the 8k WAL file 0000000300042D3B00000002 as example:
| 32 bits | 32 bits | 32 bits |
|---|---|---|
| timeline | Logical log number | Log segment number |
| 00000003 | 00042D3B | 00000002 |
It can be seen that an LSN can locate a WAL filename and the offset position within the file.
Among these, the part before the LSN slash / is the logical log number, and the 8-bit log segment number after the slash / will be used below.
Spill Filename Conversion#
Replication slot name: logical_ex2209_rep
Spill filename: xid-407989064-lsn-42D1E-20000000.spill
42D1E is not a complete LSN and cannot be directly used with pg_walfile_name to locate a WAL filename. 42D1E is a logical log number. If we directly filter WAL files containing 42D1E in the name, we find 16 WAL files.
Can we locate the WAL log segment number from the number 20000000 to pinpoint the exact file?
Spill filename generation:
/*
* Given a replication slot, transaction ID and segment number, fill in the
* corresponding spill file into 'path', which is a caller-owned buffer of size
* at least MAXPGPATH.
*/
static void
ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid,
XLogSegNo segno)
{
XLogRecPtr recptr;
XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr);
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.spill",
NameStr(MyReplicationSlot->data.name),
xid,
(uint32) (recptr >> 32), (uint32) recptr);
}The pg_replslot/%s and xid-%u-lsn parts are easy to understand — just the replication slot name and xid. The recptr needs a closer look at its definition:
/*
* Pointer to a location in the XLOG. These pointers are 64 bits wide,
* because we don't want them ever to overflow.
*/
typedef uint64 XLogRecPtr;XLogSegNoOffsetToRecPtr calculates the LSN from the WAL log segment number, segment size, and offset:
#define XLogSegNoOffsetToRecPtr(segno, offset, wal_segsz_bytes, dest) \
(dest) = (segno) * (wal_segsz_bytes) + (offset)XLogRecPtr is the LSN! So:
(uint32) (recptr >> 32) takes the first 32 bits of LSN, (uint32) recptr) takes the last 32 bits.
The first 32 bits of LSN is what we saw as the first half of LSN, lsn-42D1E. The last 32 bits of LSN actually contain more information; here we only need the first few bits of the last 32 bits — the segment number.
Since the passed-in offset=0 and we also have segno, we don’t actually need the intra-block offset information to calculate the dest value. The real value of wal_segsz_bytes is 128M = 128*1024*1024. Converting the formula in XLogSegNoOffsetToRecPtr:
segno= dest/(128*1024*1024)
-- Convert hex 20000000
segno= x'20000000'::int/(128*1024*1024)
segno= 4From this formula we can derive the log segment number segno, which lets us locate the WAL file number.
So the spill filename xid-407989064-lsn-42D1E-20000000.spill corresponds to the WAL file:
Logical log number=42D1E, segment number=04:
ls 42D1E*04
0000000200042D1E00000004pg_waldump shows xid 407989064 inside.
In practice, the WAL size is also fixed after instance creation, i.e. (128*1024*1024) is a constant, so segno is absolutely correlated with (uint32) recptr, but not equal to it. This means that switching to a new WAL log file creates a new spill file.
Summary of spill file generation rules:
- Same transaction id: if it spans multiple WAL files, it produces multiple spills. E.g., a large transaction without subtransactions spanning 3 WAL files produces 3 spill files.
- Different transaction ids produce different spills. E.g., 10 million subtransactions produce 10 million spill files.
Spill filename structure xid-407989064-lsn-42D1E-20000000.spill:
| xid | First 32 bits of LSN; i.e., WAL logical log number | Converted from WAL log segment number; not equal to segment number |
|---|---|---|
| xid-407989064 | lsn-42D1E | 20000000 |
## Recovered environment
[postgres]$ ll |head -100
total 40000276
-rw------- 1 postgres postgres 184 Dec 6 15:20 state
-rw------- 1 postgres postgres 196 Dec 6 13:25 xid-407989064-lsn-42D1E-0.spill
-rw------- 1 postgres postgres 208 Dec 6 13:25 xid-407989064-lsn-42D1E-20000000.spill
...
-rw------- 1 postgres postgres 540 Dec 6 16:44 xid-407989064-lsn-42D2A-D0000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989065-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989066-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989068-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989070-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989072-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989076-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989079-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989080-lsn-42D1D-C8000000.spill
-rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989082-lsn-42D1D-C8000000.spill
[postgres@lzlhost /myhost/liuzhilong/pg_replslot/logical_ex9e15_rep]$ ll |awk '{print $9}'|awk -F '-' '{print $2}'|sort|uniq -c|wc -l
10000003
[postgres@lzlhost /myhost/liuzhilong/pg_replslot/logical_ex9e15_rep]$ ll |wc -l
10000070So in the actual environment we saw 10,000,070 files, with 10,000,003 distinct xids among them — meaning 1 parent transaction spanning about 70 WAL files, with this parent transaction having 10 million subtransactions.
Replication Slot Spill Testing#
--Pub/sub replication link setup
logical_decoding_work_mem = 64MB #pg_ctl reload
wal_segment_size =128 MB
--source
CREATE TABLE replication_table (
id BIGSERIAL PRIMARY KEY,
column1 char(2000),
column2 char(2000),
column3 char(2000)
);
create publication pub_test for table replication_table ;
--dest
CREATE TABLE replication_table (
id BIGSERIAL PRIMARY KEY,
column1 char(2000),
column2 char(2000),
column3 char(2000)
);
CREATE SUBSCRIPTION sub_test
CONNECTION 'host=127.0.0.1 port=8094 dbname=lzl user=lzl password=qwer'
PUBLICATION pub_test;
--source
select * from pg_replication_slots;Large Transaction, No Subtransactions, Replicated Table Spill Test#
--Create a large transaction, don't commit yet
begin;
insert into replication_table(column1,column2,column3)
select 'a','b','c' from generate_series(1,1000000) g;
--Replication slot spill
ll
total 331924
-rw------- 1 postgres postgres 184 Dec 9 20:22 state
-rw------- 1 postgres postgres 88226964 Dec 9 20:22 xid-5074343-lsn-163-38000000.spill
-rw------- 1 postgres postgres 119698488 Dec 9 20:22 xid-5074343-lsn-163-40000000.spillAfter the large transaction commits, wait for consumption until replication lag is 0, and the spill files disappear:
M=# select pid,usename,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,reply_time from pg_stat_replication;
pid | usename | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | reply_time
--------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+------------------------------
163525 | lzl | 163/4996E1C8 | 163/4996E1C8 | 163/4996E1C8 | 163/4996E1C8 | [null] | [null] | [null] | 2024-12-09 20:25:35.14769+08
(1 row)
M=# select pid,usename,pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) diff_sent_mb,pg_wal_lsn_diff(pg_current_wal_lsn(),write_lsn) diff_write_mb,pg_wal_lsn_diff(pg_current_wal_lsn(),flush_lsn) diff_flush_mb,pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) diff_replay_mb,pg_walfile_name_offset(sent_lsn) sentoffset,pg_walfile_name_offset(write_lsn) writeoffset,pg_walfile_name_offset(flush_lsn) flush_lsn from pg_stat_replication;
pid | usename | diff_sent_mb | diff_write_mb | diff_flush_mb | diff_replay_mb | sentoffset | writeoffset | flush_lsn
--------+---------+--------------+---------------+---------------+----------------+-------------------------------------+-------------------------------------+-------------------------------
163525 | lzl | 0 | 0 | 0 | 0 | (000000010000016300000009,26665416) | (000000010000016300000009,26665416) | (000000
[/mypg/pg8094/data/pg_replslot/sub_test]$ ll
total 357392
-rw------- 1 postgres postgres 184 Dec 9 20:23 state
-rw------- 1 postgres postgres 88226964 Dec 9 20:22 xid-5074343-lsn-163-38000000.spill
-rw------- 1 postgres postgres 137696328 Dec 9 20:23 xid-5074343-lsn-163-40000000.spill
-rw------- 1 postgres postgres 26076708 Dec 9 20:23 xid-5074343-lsn-163-48000000.spill
[/mypg/pg8094/data/pg_replslot/sub_test]$ ll
total 4
-rw------- 1 postgres postgres 184 Dec 9 20:25 state2666
(1 row)Large Transaction, No Subtransactions, Non-Replicated Table Spill Test#
--source: create an unrelated table for writing data
CREATE TABLE no_replication_table (
id BIGSERIAL PRIMARY KEY,
column1 char(2000),
column2 char(2000),
column3 char(2000)
);
--Create a large transaction, don't commit yet
begin;
insert into no_replication_table(column1,column2,column3)
select 'a','b','c' from generate_series(1,1000000) g;
--Spill
[postgres@lzldb:MYINST:8094 /mypg/pg8094/data/pg_replslot/sub_test]$ ll
total 357492
-rw------- 1 postgres postgres 184 Dec 9 20:09 state
-rw------- 1 postgres postgres 107511456 Dec 9 20:08 xid-5074106-lsn-163-28000000.spill
-rw------- 1 postgres postgres 137698804 Dec 9 20:09 xid-5074106-lsn-163-30000000.spill
-rw------- 1 postgres postgres 4308444 Dec 9 20:09 xid-5074106-lsn-163-38000000.spillLarge Transaction, Subtransactions, Non-Replicated Table Spill Test#
## One insert per row, each insert as one subtransaction
echo "begin;">subtx.sql
for i in {1..1000000}
do
echo "savepoint p$i;">>subtx.sql
echo "insert into no_replication_table(column1,column2,column3) select 'a','b','c';">>subtx.sql
done
nohup psql -d lzl -f subtx.sql &#During execution, observed 800k+ spill files
[/myhost/pg8094/data/pg_replslot/sub_test]$ ll |wc -l
823749
[/myhost/pg8094/data/pg_replslot/sub_test]$ ll |head -10
total 1099532
-rw------- 1 postgres postgres 184 Dec 9 21:10 state
-rw------- 1 postgres postgres 1236 Dec 9 21:10 xid-5519686-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519687-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519688-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519689-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519690-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519691-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519692-lsn-163-70000000.spill
-rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519693-lsn-163-70000000.spillAnalysis of Slow Database Startup#
Startup Process Startup Flow Analysis#
Here we parse the startup flow frame by frame using the call stack:
11: main: Nothing to say.
10: PostmasterMain:
Before the main loop, it first calls the startup flow StartupPID = StartupDataBase(); which essentially calls StartChildProcess(StartupProcess):
#define StartupDataBase() StartChildProcess(StartupProcess)9: StartChildProcess: Forks a process. This process is the auxiliary process for starting postmaster; normal child process startup goes through this logic, forking at this step. The input AuxProcType=StartupProcess.
8: AuxiliaryProcessMain:
Since MyAuxProcType=StartupProcess, it goes through the StartupProcessMain flow, which is different from child processes like walsender, walwriter, bgwriter. The startup process itself exists for crash recovery WAL reading, but it does many other things:
switch (MyAuxProcType)
{
case CheckerProcess:
/* don't set signals, they're useless here */
CheckerModeMain();
proc_exit(1); /* should never return */
case BootstrapProcess:
/*
* There was a brief instant during which mode was Normal; this is
* okay. We need to be in bootstrap mode during BootStrapXLOG for
* the sake of multixact initialization.
*/
SetProcessingMode(BootstrapProcessing);
bootstrap_signals();
BootStrapXLOG();
BootstrapModeMain();
proc_exit(1); /* should never return */
case StartupProcess: //Here here here here
/* don't set signals, startup process has its own agenda */
StartupProcessMain();
proc_exit(1); /* should never return */
case BgWriterProcess:
/* don't set signals, bgwriter has its own agenda */
BackgroundWriterMain();
proc_exit(1); /* should never return */
case CheckpointerProcess:
/* don't set signals, checkpointer has its own agenda */
CheckpointerMain();
proc_exit(1); /* should never return */
case WalWriterProcess:
/* don't set signals, walwriter has its own agenda */
InitXLOGAccess();
WalWriterMain();
proc_exit(1); /* should never return */
case WalReceiverProcess:
/* don't set signals, walreceiver has its own agenda */
WalReceiverMain();
proc_exit(1); /* should never return */
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
}7: StartupProcessMain: Mainly to call StartupXLOG().
6: StartupXLOG:
Function comment:
This must be called ONCE during postmaster or standalone-backend startupStartupXLOG is always called by postmaster regardless, whether crash shutdown or consistent shutdown:
switch (ControlFile->state)
{
...
case DB_IN_PRODUCTION:
ereport(LOG,
(errmsg("database system was interrupted; last known up at %s",
str_time(ControlFile->time))));
break;This matches the log output. Here’s the shutdown and startup log:
2024-12-06 17:02:57.534 CST,,,447560,,65693cde.6d448,1325,,2023-12-01 09:54:38 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster"
2024-12-06 17:03:49.536 CST,,,211844,,6752bdf3.33b84,1,,2024-12-06 17:03:47 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"","postmaster"
2024-12-06 17:03:49.536 CST,,,211844,,6752bdf3.33b84,2,,2024-12-06 17:03:47 CST,,0,LOG,00000,"starting PostgreSQL 13.2 (RaseSQL 1.3) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39.0.1), 64-bit",,,,,,,,,"","postmaster"
2024-12-06 17:03:49.537 CST,,,211844,,6752bdf3.33b84,3,,2024-12-06 17:03:47 CST,,0,LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 7284",,,,,,,,,"","postmaster"
2024-12-06 17:03:49.539 CST,,,211844,,6752bdf3.33b84,4,,2024-12-06 17:03:47 CST,,0,LOG,00000,"listening on Unix socket ""/tmp/.s.PGSQL.7284""",,,,,,,,,"","postmaster"
2024-12-06 17:03:49.557 CST,,,211995,,6752bdf5.33c1b,1,,2024-12-06 17:03:49 CST,,0,LOG,00000,"database system was interrupted; last known up at 2024-12-06 17:00:10 CST",,,,,,,,,"","startup"So, after shutdown, the control file recorded the database state as in production:
Database cluster state: in productionThe in production state means the database is running, not a normal shutdown state — indicating that at the time of shutdown, it was not a consistent shutdown.
Continuing with the key code about fsync:
/*----------
* If we previously crashed, perform a couple of actions:
*
* - The pg_wal directory may still include some temporary WAL segments
* used when creating a new segment, so perform some clean up to not
* bloat this path. This is done first as there is no point to sync
* this temporary data.
*
* - There might be data which we had written, intending to fsync it, but
* which we had not actually fsync'd yet. Therefore, a power failure in
* the near future might cause earlier unflushed writes to be lost, even
* though more recent data written to disk from here on would be
* persisted. To avoid that, fsync the entire data directory.
*/
if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
{
RemoveTempXlogFiles();
SyncDataDirectory();
}Here, because the control file state is not a normal shutdown, it enters the if-block and calls SyncDataDirectory() for fsync persistence.
StartupXLOG does many many things. Among those related to spill, besides SyncDataDirectory(), there’s also StartupReorderBuffer():
/*
* Initialize replication slots, before there's a chance to remove
* required resources.
*/
StartupReplicationSlots();
/*
* Startup logical state, needs to be setup now so we have proper data
* during crash recovery.
*/
StartupReorderBuffer();StartupReorderBuffer is also called. It calls ReorderBufferCleanupSerializedTXNs to clean up spill files in all slot directories (but does not delete directories or state files):
/*
* Delete all data spilled to disk after we've restarted/crashed. It will be
* recreated when the respective slots are reused.
*/
void
StartupReorderBuffer(void)
{
DIR *logical_dir;
struct dirent *logical_de;
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
{
if (strcmp(logical_de->d_name, ".") == 0 ||
strcmp(logical_de->d_name, "..") == 0)
continue;
/* if it cannot be a slot, skip the directory */
if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
continue;
/*
* ok, has to be a surviving logical slot, iterate and delete
* everything starting with xid-*
*/
ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
}
FreeDir(logical_dir);
}5: SyncDataDirectory:
The function comment is very important:
/*
* Issue fsync recursively on PGDATA and all its contents.
*
* We fsync regular files and directories wherever they are, but we
* follow symlinks only for pg_wal and immediately under pg_tblspc.
* Other symlinks are presumed to point at files we're not responsible
* for fsyncing, and might not have privileges to write at all.
*
* Errors are logged but not considered fatal; that's because this is used
* only during database startup, to deal with the possibility that there are
* issued-but-unsynced writes pending against the data directory. We want to
* ensure that such writes reach disk before anything that's done in the new
* run. However, aborting on error would result in failure to start for
* harmless cases such as read-only files in the data directory, and that's
* not good either.
*
* Note that if we previously crashed due to a PANIC on fsync(), we'll be
* rewriting all changes again during recovery.
*
* Note we assume we're chdir'd into PGDATA to begin with.
*/- fsync all data directory files to persist them
- This action only happens during the startup phase
- This action ensures the data directory is fully persistent before the database starts running
The body of SyncDataDirectory recursively walks directories and fsyncs (with some special handling for symlinks):
walkdir(".", datadir_fsync_fname, false, LOG);
if (xlog_is_symlink)
walkdir("pg_wal", datadir_fsync_fname, false, LOG);
walkdir("pg_tblspc", datadir_fsync_fname, true, LOG);4: walkdir: Recurse to .
3: walkdir: Recurse to ./pg_replslot
2: walkdir: Recurse to ./pg_replslot/slotname
1: lstat: C library call. walkdir not only does fsync (via the callback datadir_fsync_fname), the walkdir function body also does lstat to get file info such as inode, file size, last modification time, etc. — similar to the Linux stat command.
0: _lxstat: C library call.
Startup logic summary:
- PG starts an auxiliary process
startupto help with startup. Unlike common child processes (walwriter, bgwriter, checkpointer, etc.), it’s always started during the startup process and does many things. StartupXLOGis always called during startup, whether or not the database was consistently shut down.- Only in a non-normal shutdown state does
SyncDataDirectoryget triggered. SyncDataDirectoryfsyncs all data files for persistence and checks stat info for all data files.- fsync ensures data file consistency before startup; stat is probably to verify files are normal and readable (before the startup process starts, only the readability of the datadir directory was verified).
- Regardless of shutdown state,
StartupReorderBufferis always called and cleans up spill files for all replication slots.
When Is the Ready State?#
After the startup process finishes its work, the database is not yet in ready state. When the pmState state machine changes state, the reaper process reaping function is called. The reaper function itself does some recovery or startup work after a child process exits. The pmState state machine records the state as PM_STARTUP, which controls the startup/shutdown state.
Last steps of PostmasterMain:
StartupPID = StartupDataBase();
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
pmState = PM_STARTUP; //State machine changes state
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
status = ServerLoop();
/*
* ServerLoop probably shouldn't ever return, but if it does, close down.
*/
ExitPostmaster(status != STATUS_OK);
abort(); /* not reached */
}The core startup flow of PostmasterMain goes to reaper to handle the normal exit of the startup process.
PMState comment:
/*
* We use a simple state machine to control startup, shutdown, and
* crash recovery (which is rather like shutdown followed by startup).
*
* After doing all the postmaster initialization work, we enter PM_STARTUP
* state and the startup process is launched. The startup process begins by
* reading the control file and other preliminary initialization steps.
* In a normal startup, or after crash recovery, the startup process exits
* with exit code 0 and we switch to PM_RUN state.
PMState is passed and processed via signals. After the startup process exits, reaper is activated to reap the process.
reaper function handling the startup child process’s normal exit:
if (pid == StartupPID)
{
StartupPID = 0;
...
/*
* Startup succeeded, commence normal operations
*/
StartupStatus = STARTUP_NOT_RUNNING; //Transition from STARTUP_RUNNING to STARTUP_NOT_RUNNING
FatalError = false; //After none of the above ifs are hit, it's not fatal
AbortStartTime = 0;
ReachedNormalRunning = true;
pmState = PM_RUN; //State machine transitions from PM_STARTUP to PM_RUN
connsAllowed = ALLOW_ALL_CONNS;
/*
* Crank up the background tasks, if we didn't do that already
* when we entered consistent recovery state. It doesn't matter
* if this fails, we'll just try again later.
*/
//Below: starting core child processes
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
/*
* Likewise, start other special children as needed. In a restart
* situation, some of them may be alive already.
*/
//Below: starting non-core child processes
if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
AutoVacPID = StartAutoVacLauncher();
if (PgArchStartupAllowed() && PgArchPID == 0)
PgArchPID = pgarch_start();
if (PgStatPID == 0)
PgStatPID = pgstat_start();
/* workers may be scheduled to start now */
maybe_start_bgworkers();
//At this point it's officially ready to accept connections
/* at this point we are really open for business */
ereport(LOG,
(errmsg("database system is ready to accept connections")));
/* Report status */
AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_READY);
#ifdef USE_SYSTEMD
sd_notify(0, "READY=1");
#endif
continue;
}The “database system is ready to accept connections” message is right here.
Checkpointer, bgwriter, walwriter, autovacuum, arch (if present), stats — all these processes need to be started. At this stage, launching these processes doesn’t have to return success; they can be retried later in ServerLoop or on the next reaper execution. Only the startup process must start and complete all related tasks in one shot:
if (pid < 0)
{
/* in parent, fork failed */
int save_errno = errno;
errno = save_errno;
switch (type)
{
case StartupProcess:
ereport(LOG,
(errmsg("could not fork startup process: %m")));
break;
case BgWriterProcess:
ereport(LOG,
(errmsg("could not fork background writer process: %m")));
break;
case CheckpointerProcess:
ereport(LOG,
(errmsg("could not fork checkpointer process: %m")));
break;
case WalWriterProcess:
ereport(LOG,
(errmsg("could not fork WAL writer process: %m")));
break;
case WalReceiverProcess:
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
break;
}
/*
* fork failure is fatal during startup, but there's no need to choke
* immediately if starting other child types fails.
*/
if (type == StartupProcess)
ExitPostmaster(1);
return 0;
}Spill File Generation Logic Across Versions#
Spill in all versions spills the largest transaction. Here we focus on when spilling happens.
PG12: pg12 hard-codes 4096 changes:
static const Size max_changes_in_memory = 4096;/*
* Check whether the transaction tx should spill its data to disk.
*/
static void
ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
/*
* TODO: improve accounting so we cheaply can take subtransactions into
* account here.
*/
if (txn->nentries_mem >= max_changes_in_memory)
{
ReorderBufferSerializeTXN(rb, txn);
Assert(txn->nentries_mem == 0);
}
}PG13: Spills when exceeding logical_decoding_work_mem memory size:
static void
ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
...
while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by serializing it to disk.
*/
txn = ReorderBufferLargestTXN(rb);
ReorderBufferSerializeTXN(rb, txn);
...
}PG14: Adds streaming transfer ReorderBufferStreamTXN:
static void
ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
...
while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if possible. Otherwise, spill to disk.
*/
if (ReorderBufferCanStartStreaming(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
{...
ReorderBufferStreamTXN(rb, txn);
}
else
{...
ReorderBufferSerializeTXN(rb, txn);
}
...
}Although PG14 has streaming replication, triggering it requires certain conditions:
/* Returns true, if the streaming can be started now, false, otherwise. */
static inline bool
ReorderBufferCanStartStreaming(ReorderBuffer *rb)
{
LogicalDecodingContext *ctx = rb->private_data;
SnapBuild *builder = ctx->snapshot_builder;
/* We can't start streaming unless a consistent state is reached. */
if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
return false;
/*
* We can't start streaming immediately even if the streaming is enabled
* because we previously decoded this transaction and now just are
* restarting.
*/
if (ReorderBufferCanStream(rb) &&
!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
return true;
return false;
} /*
* Found a point after SNAPBUILD_FULL_SNAPSHOT where all transactions that
* were running at that point finished. Till we reach that we hold off
* calling any commit callbacks.
*/
SNAPBUILD_CONSISTENT = 2Additional streaming trigger conditions:
- Condition 1: All transactions covered by the snapshot have completed (presumably committed or rolled back)
- Condition 2: The context is private data (does this mean two links to one table won’t trigger streaming?)
- Condition 3: Transactions in the snapshot are not skippable (probably some special transactions can be skipped)
PG15: Similar to 14, just cleaner functions with less nesting.
PG16: About the same.
PG17: About the same, adds DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE to force streaming.
Key points to remember:
- PG12 and earlier: hard-coded 4096 changes
- PG13: adds
logical_decoding_work_memparameter, allowing memory size adjustment to reduce spill probability - PG14 and later: supports streaming replication
- Triggering streaming also requires certain conditions, so even with streaming, spills can still happen
- PG17: adds
debug_logical_replication_streamingparameter to force streaming
Spill File Cleanup Logic#
Startup-time spill cleanup is just one scenario. There’s also walsender startup cleanup and drop slot cleanup.
Walsender Startup Cleanup#
ReorderBufferCleanupSerializedTXNs is called during database startup (before walsender has started) and during walsender startup (while the database is running). Note these are different scenarios, though they call the same function. From the function comment, it’s meant to “remove leftover serialized reorder buffers” — i.e., clean up spill files.
/*
* Remove any leftover serialized reorder buffers from a slot directory after a
* prior crash or decoding session exit.
*/
static void
ReorderBufferCleanupSerializedTXNs(const char *slotname)
{
DIR *spill_dir;
struct dirent *spill_de;
struct stat statbuf;
char path[MAXPGPATH * 2 + 12];
sprintf(path, "pg_replslot/%s", slotname);
/* we're only handling directories here, skip if it's not ours */
if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
return;
spill_dir = AllocateDir(path);
while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
{
/* only look at names that can be ours */
//Only compare first 3 characters
if (strncmp(spill_de->d_name, "xid", 3) == 0)
{
snprintf(path, sizeof(path),
"pg_replslot/%s/%s", slotname,
spill_de->d_name);
if (unlink(path) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\" during removal of pg_replslot/%s/xid*: %m",
path, slotname)));
}
}
FreeDir(spill_dir);
}Two things to note about the above cleanup logic:
- Cleans files whose names start with “xid”. Obviously, the state file is not cleaned.
- Uses unlink to clean, one file at a time. This can help us devise a faster startup scheme.
Database Startup Cleanup#
During database startup, a startup process is forked to clean slots. The cleanup function is the same one walsender calls: ReorderBufferCleanupSerializedTXNs.
One more difference: after walsender restarts, it only cleans spills for the current slot with the same name; whereas during database startup, all slot spills are cleaned sequentially.
Database startup process, while-loop sequential cleanup logic:
void
StartupReorderBuffer(void)
{
DIR *logical_dir;
struct dirent *logical_de;
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
{ //Exclude . and ..
if (strcmp(logical_de->d_name, ".") == 0 ||
strcmp(logical_de->d_name, "..") == 0)
continue;
//Validate slot name
/* if it cannot be a slot, skip the directory */
if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
continue;
/*
* ok, has to be a surviving logical slot, iterate and delete
* everything starting with xid-*
*/
ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
}
FreeDir(logical_dir);
}The while loop calls ReorderBufferCleanupSerializedTXNs, and after that, the logic is the same as walsender startup cleanup.
Manual Cleanup via pg_drop_replication_slot#
The drop slot cleanup logic is different from the automatic spill file cleanup — it does not call ReorderBufferCleanupSerializedTXNs.
Drop slot flow:
pg_drop_replication_slot(PG_FUNCTION_ARGS) -> ReplicationSlotDrop(const char *name, bool nowait) -> ReplicationSlotDropAcquired(void) -> ReplicationSlotDropPtr
ReplicationSlotDropPtr’s slot cleanup logic is also interesting:
/*
* Permanently drop the replication slot which will be released by the point
* this function returns.
*/
static void
ReplicationSlotDropPtr(ReplicationSlot *slot)
{
char path[MAXPGPATH];
char tmppath[MAXPGPATH];
/*
* If some other backend ran this code concurrently with us, we might try
* to delete a slot with a certain name while someone else was trying to
* create a slot with the same name.
*/
LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
/* Generate pathnames. */
sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
sprintf(tmppath, "pg_replslot/%s.tmp", NameStr(slot->data.name));
/*
* Rename the slot directory on disk, so that we'll no longer recognize
* this as a valid slot. Note that if this fails, we've got to mark the
* slot inactive before bailing out. If we're dropping an ephemeral or a
* temporary slot, we better never fail hard as the caller won't expect
* the slot to survive and this might get called during error handling.
*/
if (rename(path, tmppath) == 0) //rename file
{
/*
* We need to fsync() the directory we just renamed and its parent to
* make sure that our changes are on disk in a crash-safe fashion. If
* fsync() fails, we can't be sure whether the changes are on disk or
* not. For now, we handle that by panicking;
* StartupReplicationSlots() will try to straighten it out after
* restart.
*/
//fsync persistence
START_CRIT_SECTION();
fsync_fname(tmppath, true);
fsync_fname("pg_replslot", true);
END_CRIT_SECTION();
}
...
/*
* If removing the directory fails, the worst thing that will happen is
* that the user won't be able to create a new slot with the same name
* until the next server restart. We warn about it, but that's all.
*/
if (!rmtree(tmppath, true))
ereport(WARNING,
(errmsg("could not remove directory \"%s\"", tmppath)));
/*
* We release this at the very end, so that nobody starts trying to create
* a slot while we're still cleaning up the detritus of the old one.
*/
LWLockRelease(ReplicationSlotAllocationLock);
}Drop slot doesn’t directly unlink files in the slot directory. Instead, it first renames the slotname/ directory to slotname.tmp/, then unlinks the files inside, and finally removes the slotname.tmp/ directory itself.
In this, rmtree also loops to unlink files.
Accelerated Startup Scheme After Replication Slot Spill#
Deleting 10 million spill files is obviously very slow, but directly moving the directory (mv) is extremely fast. However, direct mv requires attention to the name after the move and the state file, as well as knowing which source code step the mv bypasses.
mv Naming Notes#
Since it was an abnormal shutdown, the startup process will execute SyncDataDirectory to fsync and stat all data files — this is hard to bypass. After SyncDataDirectory completes, it starts handling replication slots. When handling slots, it calls StartupReorderBuffer() -> ReorderBufferCleanupSerializedTXNs to fully clean up spill files.
Before entering cleanup, ReplicationSlotValidateName validates the slot name. We can exploit ReplicationSlotValidateName to trick the startup process into skipping the ReorderBufferCleanupSerializedTXNs process.
ReplicationSlotValidateName rules:
bool
ReplicationSlotValidateName(const char *name, int elevel)
{
...
for (cp = name; *cp; cp++)
{ //Key rule here
if (!((*cp >= 'a' && *cp <= 'z')
|| (*cp >= '0' && *cp <= '9')
|| (*cp == '_')))
{
ereport(elevel,
(errcode(ERRCODE_INVALID_NAME),
errmsg("replication slot name \"%s\" contains invalid character",
name),
errhint("Replication slot names may only contain lower case letters, numbers, and the underscore character.")));
return false;
}
}
return true;
}Valid slot names only contain a-z, 0-9, _.
So when renaming, it’s recommended to add a dot .:
- Recommended:
slotname.bak,slotname.20241215, etc. - Not recommended:
slotnamebackup,slotname20241215,slotname_bak, etc. - Not recommended:
.tmpsuffix — slot names with.tmphave special meaning.
After renaming, you need to create the directory and copy the state file, otherwise the slot will behave strangely on startup (e.g., duplicate slot names, auto-generated slot names, inability to delete slots, downstream unable to start the replication link, etc.).
Recommended mv operations summarized:
cd pg_replslot
mv slotname slotname.bak
mkdir slotname
cp slotname.bak/state slotname/Startup Time Comparison#
Compare startup speed across different source code flows to see if manual mv/rm acceleration is actually meaningful.
Reference source logic principles:
- Normal shutdown: goes through fsync and stat
- Abnormal shutdown: goes through fsync and stat
- Valid mv: rename slot directory to
.bak, skip unlink - Invalid mv: rename slot directory to
_bak, spill files start with xid, goes through unlink
Since actual spill files would be too slow, I manually created fake slot directories and spill files: 50 slots total, 400k spills per slot, 20 million spills total, to test startup time (using cp directory is much faster than cp or dd files).
| # | Test Plan | Startup Time |
|---|---|---|
| 1 | Normal shutdown; no fsync/stat, no unlink | 0.1 seconds |
| 2 | Normal shutdown, invalid mv; no fsync/stat, unlink | 11 min 41 sec |
| 3 | Abnormal shutdown, valid mv; fsync/stat, no unlink | 4 min 35 sec |
| 4 | Abnormal shutdown, invalid mv; fsync/stat, unlink | 32 min 2 sec |
| 5 | Abnormal shutdown, rm (create slot dir, keep state) | 13 min 4 sec |
Comparing plans 3 and 5, theoretically in the scenario at hand, a valid mv could achieve startup in about 4 minutes, while rm would take about 13 minutes. (This is a rough comparison; the recovery environment already showed some differences.)