Skip to main content
  1. Posts/

Query Conflicts: From a Static Table Conflict to Its Root Cause

·3342 words·16 mins
liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.

Problem Symptoms
#

The Symptom
#

A static historical table with no updates whatsoever — yet queries on the same-city standby consistently hit query conflicts:

ERROR:  40001: canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.
LOCATION:  ProcessInterrupts, postgres.c:3197
Time: 30534.973 ms (00:30.535)

Why a Query Conflict on a Static Table Matters
#

My understanding was that a static table should never experience conflicts (this understanding was wrong — I’ll explain later).

The official documentation lists the conflict cases:

  • Access Exclusive locks taken on the primary server, including both explicit LOCK commands and various DDL actions, conflict with table accesses in standby queries.
  • Dropping a tablespace on the primary conflicts with standby queries using that tablespace for temporary work files.
  • Dropping a database on the primary conflicts with sessions connected to that database on the standby.
  • Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still “see” any of the rows to be removed.
  • Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page on the standby, whether or not the data to be removed is visible.

LOCK, DDL, drop tablespace, drop database — definitely none of those.

Vacuum — none either, confirmed by pg_stat_all_tables.last_autovacuum and WAL vacuum records.

The official documentation’s explanation stops there. I carefully verified that none of the above applied.

Extrapolating from existing knowledge, perhaps other scenarios could kill the xmin held by a standby query’s snapshot. For example, in-page pruning removes xmin from rows on a page — if the standby query’s snapshot still depends on those xmins, theoretically a conflict could occur. But a page belongs to a specific table, and querying only one table holds only snapshots and xmins on that table. So, theoretically, in-page pruning on table A should not cause a query conflict on table B (this understanding was also wrong — I’ll explain later).

PG’s official documentation on query conflict scenarios is fairly vague and doesn’t explain well why a static table can experience conflicts. Even combining it with my own extrapolations, there shouldn’t be a conflict. But I noticed this pattern seemed to exist on many instances, so it was worth investigating.

Root Cause Analysis
#

Since the startup process kills the query, checking the startup process’s pstack should reveal the conflict function:

$ pstack 212012
#0  0x00002b283f63d783 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000008fcf5a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x0000000000787905 in WaitExceedsMaxStandbyDelay (wait_event_info=134217762) at standby.c:208
#3  ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x2398a50, reason=reason@entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=wait_event_info@entry=134217762, report_waiting=report_waiting@entry=true) at standby.c:276
#4  0x0000000000787b33 in ResolveRecoveryConflictWithVirtualXIDs (report_waiting=true, wait_event_info=134217762, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=<optimized out>) at standby.c:333
#5  ResolveRecoveryConflictWithSnapshot (latestRemovedXid=<optimized out>, node=...) at standby.c:329
#6  0x00000000004c8ffe in heap_xlog_clean (record=0x2366978) at heapam.c:7764
#7  heap2_redo (record=0x2366978) at heapam.c:8917
#8  0x0000000000519e55 in StartupXLOG () at xlog.c:7411
#9  0x000000000072f211 in StartupProcessMain () at startup.c:204
#10 0x00000000005286b1 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffeb7e39d70) at bootstrap.c:450
#11 0x000000000072c369 in StartChildProcess (type=StartupProcess) at postmaster.c:5494
#12 0x000000000072eb54 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x232edb0) at postmaster.c:1407
#13 0x00000000004892cf in main (argc=3, argv=0x232edb0) at main.c:210

XLOG_HEAP2_CLEAN
#

void
heap2_redo(XLogReaderState *record)
{
	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;

	switch (info & XLOG_HEAP_OPMASK)
	{
		case XLOG_HEAP2_CLEAN:
			heap_xlog_clean(record);
			break;

Only when the redo is XLOG_HEAP2_CLEAN does it enter the next function heap_xlog_clean.

PG 18 no longer has XLOG_HEAP2_CLEAN (it was actually removed around PG15 — this article only looks at versions 13 and 18), but the define can still be found in heapam_xlog.h:

//pg13
#define XLOG_HEAP2_CLEAN		0x10
#define XLOG_HEAP2_FREEZE_PAGE	0x20
#define XLOG_HEAP2_CLEANUP_INFO 0x30
//pg18
 * There's no difference between XLOG_HEAP2_PRUNE_ON_ACCESS,
 * XLOG_HEAP2_PRUNE_VACUUM_SCAN and XLOG_HEAP2_PRUNE_VACUUM_CLEANUP records.
 * They have separate opcodes just for debugging and analysis purposes, to
 * indicate why the WAL record was emitted.
 */
#define XLOG_HEAP2_PRUNE_ON_ACCESS		0x10
#define XLOG_HEAP2_PRUNE_VACUUM_SCAN	0x20
#define XLOG_HEAP2_PRUNE_VACUUM_CLEANUP	0x30

I pulled out PG18’s source because PG13 (our production version) has zero explanation for these CLEAN xl_info macros, making them hard to understand. Since PG18 renamed the macros to something more intuitive and added comments, we can use PG18’s source to understand PG13’s — to figure out what this WAL record does.

All three opcodes are fundamentally PRUNE-related WAL records. From the names, PRUNE_ON_ACCESS looks like pruning triggered by access, while the other two are tied to VACUUM operations.

When checking with pg_waldump, rmgr: Heap2 CLEAN remxid records appear every few seconds, with highly varied filenodes and no relation to the static table:

$ pg_waldump 00000001000012FE00000001 |tail -200|egrep -i heap2
pg_waldump: fatal: error in WAL record at 12FE/F34F138: invalid resource manager ID 50 at 12FE/F34F168
rmgr: Heap2       len (rec/tot):     61/  3520, tx:          0, lsn: 12FE/0F346ED0, prev 12FE/0F346EA0, desc: CLEAN remxid 1983744188, blkref #0: rel 1663/88121/1083807 blk 617606 FPW
rmgr: Heap2       len (rec/tot):     66/    66, tx:          0, lsn: 12FE/0F34BC60, prev 12FE/0F34BC30, desc: CLEAN remxid 1984090598, blkref #0: rel 1663/88121/504681 blk 1447147

This matches our symptom pattern: no vacuum activity, but PRUNE is happening, leading into heap_xlog_cleanResolveRecoveryConflictWithSnapshot and the rest of the conflict machinery.

The PRUNE action producing rmgr: Heap2 CLEAN remxid WAL records will be demonstrated later via testing.

Let’s finish the source code analysis first.

ResolveRecoveryConflictWithSnapshot
#

void
ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
{
	VirtualTransactionId *backends;

	/*
	 * If we get passed InvalidTransactionId then we do nothing (no conflict).
	 *
	 * This can happen when replaying already-applied WAL records after a
	 * standby crash or restart, or when replaying an XLOG_HEAP2_VISIBLE
	 * record that marks as frozen a page which was already all-visible.  It's
	 * also quite common with records generated during index deletion
	 * (original execution of the deletion can reason that a recovery conflict
	 * which is sufficient for the deletion operation must take place before
	 * replay of the deletion record itself).
	 */
	if (!TransactionIdIsValid(latestRemovedXid))
		return;

	backends = GetConflictingVirtualXIDs(latestRemovedXid,
										 node.dbNode);

	ResolveRecoveryConflictWithVirtualXIDs(backends,
										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
										   WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
										   true);
}

There are several types of query conflicts. ResolveRecoveryConflictWithSnapshot lives up to its name — it’s a snapshot conflict.

GetConflictingVirtualXIDs finds which backends conflict with the snapshot. ResolveRecoveryConflictWithVirtualXIDs handles the actual conflict resolution and timeout.

GetConflictingVirtualXIDs
#

GetConflictingVirtualXIDs is the key function that determines whether a backend’s virtual transaction ID triggers a query conflict. It requires a bit of brainpower.

Prerequisite knowledge for understanding this function:

  • limitXmin is latestRemovedXid — the CLEAN remxid from WAL, the xid that needs to be cleaned up (I read remxid as “remove xid”). /*limitXmin is supplied as either latestRemovedXid, or InvalidTransactionId*/
  • PGPROC contains current process info: backend id, database id, lock info, and much more
  • PGXACT contains the transaction info for the snapshot held by the current process. It’s lighter — the key field is xmin, the lowest xid the current process considers still running
  • C’s || rule: if either operand is true (non-zero), the result is true (1)
  • TransactionIdIsValid means xid != 0 — 0 is meaningless

Key function GetConflictingVirtualXIDs explained:

VirtualTransactionId *
GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
{
	...
	for (index = 0; index < arrayP->numProcs; index++)   // iterate all local processes
	{
		int			pgprocno = arrayP->pgprocnos[index];
		PGPROC	   *proc = &allProcs[pgprocno];   // process's PGPROC
		PGXACT	   *pgxact = &allPgXact[pgprocno];  // process's PGXACT

		/* Exclude prepared transactions */
		if (proc->pid == 0)   // prepared transactions have no owning process — can't handle
			continue;

		if (!OidIsValid(dbOid) ||         // global tables have dbOid=0 which is invalid — satisfies condition
			proc->databaseId == dbOid)   // only process current database. Cross-db is different — no transaction conflict at all.
		{
			/* Fetch xmin just once - can't change on us, but good coding */
			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);  // pgxact->xmin is the minimum xid of transactions held by this process. UINT32_ACCESS_ONCE is just for atomic access protection — the xmin logic is unchanged

			/*
			 * We ignore an invalid pxmin because this means that backend has
			 * no snapshot currently. We hold a Share lock to avoid contention
			 * with users taking snapshots.  That is not a problem because the
			 * current xmin is always at least one higher than the latest
			 * removed xid, so any new snapshot would never conflict with the
			 * test here.
			 */
			if (!TransactionIdIsValid(limitXmin) ||  // limitXmin=0 possible? At least latestRemovedXid can't be — I can't think of a scenario where WAL would log an invalid xid
				(TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin)))   // TransactionIdIsValid(pxmin) is also not really needed. !TransactionIdFollows(pxmin, limitXmin) means pxmin <= limitXmin
			{
				VirtualTransactionId vxid;

				GET_VXID_FROM_PGPROC(vxid, *proc);
				if (VirtualTransactionIdIsValid(vxid))
					vxids[count++] = vxid;
			}
		}
	}

The critical line is !TransactionIdFollows(pxmin, limitXmin).

So the core logic for determining query conflicts is:

  1. The primary’s cleaned remxid >= the standby query’s snapshot-held minimum xid → conflict.
  2. Only kills queries in the current database; global system tables (no database) are killed indiscriminately.

This means: even if the pruned table on the primary has nothing to do with the table being queried on the standby, a conflict CAN occur!!!

In-Page Pruning
#

Now that the conflict logic is clear, we still need to understand where the WAL CLEAN records come from. That requires looking at how PRUNE is triggered.

From README.HOT on when pruning and defragmentation occur — “When can/should we prune or defragment?”:

The currently planned heuristic is to prune and defrag when first accessing a page that potentially has prunable tuples

Prune and defragment are indeed two distinct concepts, but they often happen together.

  • Prune: updating line pointers to shorten HOT chains, but doesn’t free space
  • Defragment: reclaiming space from dead line pointers and tuples after pruning

We cannot prune or defragment unless we can get a “buffer cleanup lock” on the target page; otherwise, pruning might destroy line pointers that other backends have live references to, and defragmenting might move tuples that other backends have live pointers to

The page must be under a “buffer cleanup lock” for prune or defragment to occur.

The worst-case consequence of this is only that an UPDATE cannot be made HOT but has to link to a new tuple version placed on some other page, for lack of centralized space on the original page.

A typical scenario: a HOT update spills to another page (easy to test).

space reclamation happens during tuple retrieval when the page is nearly full (<10% free) and a buffer cleanup lock can be acquired. This means that UPDATE, DELETE, and SELECT can trigger space reclamation, but often not during INSERT … VALUES because it does not retrieve a row.

SELECT/UPDATE/DELETE that scan rows can trigger space reclamation. INSERT typically won’t, since it doesn’t retrieve rows.

Clearly, after prune or defragment, the corresponding xids should be reclaimed. From the README we can see that HOT updates can reproduce prune/defragment, generating CLEAN WAL records. See [Test: Pure UPDATE Produces In-Page Pruning](## Test: Pure UPDATE Produces In-Page Pruning).

Testing
#

The tests below only observe whether conflicts occur, whether CLEAN WAL records appear, or whether page line pointers are updated — without distinguishing prune vs. defragment. In many cases both are triggered together; distinguishing them is tedious and maybe best left for later. The focus here is whether CLEAN WAL records appear.

Helper SQL:

--sql for test
--heap_page_items
select t_ctid,lp,
case lp_flags
when 0 then '0:LP_UNUSED'
when 1 then 'LP_NORMAL'
when 2 then 'LP_REDIRECT'
when 3 then 'LP_DEAD'
end as lp_flags,
t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40)
from heap_page_items(get_raw_page('lzl',0)) item,
LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info
order by lp;

--heap header
select * from page_header(get_raw_page('lzl',0));

--bt_page_items
SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items('idxlzl',1);

--create table
create table lzl(a char(2000));
create index idxlzl on lzl(a);
insert into lzl values('z');
update lzl set a=md5(random()::text);  -- non-hot
update lzl set a='z';  -- hot

--force index scan
set enable_seqscan =off;
set enable_indexonlyscan=off;

--open an RR transaction to hold a snapshot for observation
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;

Test: Cross-Table Query Conflict
#

primarystandby
create table lzl(a bigint primary key);
insert into lzl values(1);
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select 1;
update lzl set a=2;
no blocking
vacuum lzl;
#3 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x277c340, reason=reason@entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=wait_event_info@entry=134217762, report_waiting=report_waiting@entry=true) at standby.c:276
#4 0x0000000000787b33 in ResolveRecoveryConflictWithVirtualXIDs (report_waiting=true, wait_event_info=134217762, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=) at standby.c:333
#5 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=, node=…) at standby.c:329
#6 0x00000000004c8ffe in heap_xlog_clean (record=0x273a258) at heapam.c:7764

Conclusion: As long as a query exists, it has a snapshot, and a snapshot has a snapshot xmin. Even if the queried table is completely unrelated, a query conflict CAN occur.

Test: Vacuum Produces In-Page Pruning
#

Pruning occurs, conflicts occur. Example omitted — not relevant to this case.

Test: UPDATE Produces In-Page Pruning
#

--HOT, off-page update triggers defragment
--An 8k heap page stores 4-2xx rows. Here we size rows so 4 fit and remain HOT — the next update spills off-page
create table lzl(a char(2000));
create table idxlzl on lzl(a);
insert into lzl values('z');
update lzl set a='z';  --hot
update lzl set a='z';  --hot
update lzl set a='z';  --hot

--heap page: 4 rows, all HOT:
 t_ctid | lp | lp_flags  |  t_xmin  |  t_xmax  | t_cid |                                                raw_flags                                                 | combined_flags |
--------+----+-----------+----------+----------+-------+----------------------------------------------------------------------------------------------------------+----------------+----------
 (0,2)  |  1 | LP_NORMAL | 34954161 | 34954162 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}                              | {}             | \x501f000
 (0,3)  |  2 | LP_NORMAL | 34954162 | 34954163 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {}             | \x501f000
 (0,4)  |  3 | LP_NORMAL | 34954163 | 34954164 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}                     | {}             | \x501f000
 (0,4)  |  4 | LP_NORMAL | 34954164 |        0 |     0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}                                        | {}             | \x501f000
(4 rows)

--index: only one entry:
 itemoffset | ctid  | itemlen | nulls | vars | dead | htid  | some_tids
------------+-------+---------+-------+------+------+-------+-----------
          1 | (0,1) |      48 | f     | t    | f    | (0,1) | [null]
--One more update triggers off-page update
update lzl set a='z';  --page full, can't HOT

--HOT chain changed. LP changed
 t_ctid | lp |  lp_flags   |  t_xmin  |  t_xmax  | t_cid  |                                      raw_flags                                       | combined_flags |
--------+----+-------------+----------+----------+--------+--------------------------------------------------------------------------------------+----------------+---------------------------
 [null] |  1 | LP_REDIRECT |   [null] |   [null] | [null] | [null]                                                                               | [null]         | [null]
 (0,2)  |  2 | LP_NORMAL   | 34954165 |        0 |      0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}                    | {}             | \x501f00007a20202020202020
 [null] |  3 | 0:LP_UNUSED |   [null] |   [null] | [null] | [null]                                                                               | [null]         | [null]
 (0,2)  |  4 | LP_NORMAL   | 34954164 | 34954165 |      0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {}             | \x501f00007a20202020202020
(4 rows)

--index: still only one entry, unchanged:
 itemoffset | ctid  | itemlen | nulls | vars | dead | htid  | some_tids
------------+-------+---------+-------+------+------+-------+-----------
          1 | (0,1) |      48 | f     | t    | f    | (0,1) | [null]

The next update doesn’t go to a new page — instead, in-page pruning happens first, freeing space on the same page, so the row is written locally. This saves a page access.

WAL produces CLEAN remxid, confirming that a query conflict can occur:

rmgr: Heap2       len (rec/tot):     62/    62, tx:          0, lsn: 3DB/F8017348, prev 3DB/F8017310, desc: CLEAN remxid 34954177, blkref #0: rel 1663/5893914/5893920 blk 0
rmgr: Heap        len (rec/tot):   2070/  2070, tx:   34954178, lsn: 3DB/F8017388, prev 3DB/F8017348, desc: HOT_UPDATE off 4 xmax 34954178 flags 0x10 ; new off 2 xmax 0, blkref #0: rel 1663/5893914/5893920 blk 0

Conclusion: UPDATE statements can produce in-page pruning and can cause query conflicts.

Test: Hint-Bit Writeback Producing In-Page Pruning?
#

primarystandby
wal_log_hints=on
truncate table lzl;
insert into lzl values(‘z’);
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select * from lzl;
delete from lzl where a=‘z’;
checkpoint;
select * from lzl;
–WAL contains FPI_FOR_HINT
–no query conflict

Standby pageinspect:

 t_ctid | lp | lp_flags  |  t_xmin  |  t_xmax  | t_cid |                                  raw_flags                                   | combined_flags |                                    substring
--------+----+-----------+----------+----------+-------+------------------------------------------------------------------------------+----------------+-------------------------------------------------
 (0,1)  |  1 | LP_NORMAL | 34954229 | 34954230 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_KEYS_UPDATED} | {}             | \x501f00007a202020202020202020202020202020202020
(1 row)

Conclusion: WAL log hints only sync hint bits and don’t affect xmin/xmax. No CLEAN or similar records are produced, so hint-bit writeback does NOT cause query conflicts.

Test: SELECT Produces In-Page Pruning
#

SELECT normally doesn’t cause pruning, but it does when the page is nearly full: https://www.modb.pro/db/1683648157451362304

Testing pruning on a full page:

-- Same table as before, 4 HOT rows, nearly full
insert into lzl values('z');
update lzl set a='z';
update lzl set a='z';
update lzl set a='z';

--page at this point:
 t_ctid | lp | lp_flags  |  t_xmin  |  t_xmax  | t_cid |                                                raw_flags                                                 | combined_flags |
--------+----+-----------+----------+----------+-------+----------------------------------------------------------------------------------------------------------+----------------+---------------------
 (0,2)  |  1 | LP_NORMAL | 34954232 | 34954233 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED}                              | {}             | \x501f00007a20202020
 (0,3)  |  2 | LP_NORMAL | 34954233 | 34954234 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {}             | \x501f00007a20202020
 (0,4)  |  3 | LP_NORMAL | 34954234 | 34954235 |     0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE}                     | {}             | \x501f00007a20202020
 (0,4)  |  4 | LP_NORMAL | 34954235 |        0 |     0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}                                        | {}             | \x501f00007a20202020
(4 rows)

-- A SELECT
select * from lzl;

--page now shows in-page pruning:
 t_ctid | lp |  lp_flags   |  t_xmin  | t_xmax | t_cid  |                                       raw_flags                                       | combined_flags |                                    sub
--------+----+-------------+----------+--------+--------+---------------------------------------------------------------------------------------+----------------+---------------------------------------
 [null] |  1 | LP_REDIRECT |   [null] | [null] | [null] | [null]                                                                                | [null]         | [null]
 [null] |  2 | 0:LP_UNUSED |   [null] | [null] | [null] | [null]                                                                                | [null]         | [null]
 [null] |  3 | 0:LP_UNUSED |   [null] | [null] | [null] | [null]                                                                                | [null]         | [null]
 (0,4)  |  4 | LP_NORMAL   | 34954235 |      0 |      0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {}             | \x501f00007a20202020202020202020202020

Conclusion: SELECT can produce in-page pruning and can cause query conflicts.

Test: Shared Table Cross-Database Query Conflict
#

Shared tables are global. Earlier in GetConflictingVirtualXIDs we saw that global tables are killed indiscriminately. Let’s test.

Shared table info:

Source definition: IsSharedRelation
Source check: shared ? InvalidOid : MyDatabaseId;
Table: pg_class.relisshared
Directory: global/

Querying pg_class.relisshared directly is easier:

select relname,relkind,relisshared from pg_class where relisshared is true and relkind='r';
        relname        | relkind | relisshared
-----------------------+---------+-------------
 pg_authid             | r       | t
 pg_subscription       | r       | t
 pg_database           | r       | t
 pg_db_role_setting    | r       | t
 pg_tablespace         | r       | t
 pg_auth_members       | r       | t
 pg_shdepend           | r       | t
 pg_shdescription      | r       | t
 pg_replication_origin | r       | t
 pg_shseclabel         | r       | t

pg_authid stores role/user info. Testing with a password change:

--Test: on the primary, in a non-business database
create user lzl;
alter user lzl with password '1';  --run several times

CLEAN remxid appears:

rmgr: Heap        len (rec/tot):     76/    76, tx:   34954264, lsn: 3DB/F808D0F8, prev 3DB/F808D0B8, desc: HOT_UPDATE off 67 xmax 34954264 flags 0x20 ; new off 66 xmax 0, blkref #0: rel 1664/0/1260 blk 0
rmgr: Transaction len (rec/tot):     82/    82, tx:   34954264, lsn: 3DB/F808D148, prev 3DB/F808D0F8, desc: COMMIT 2025-09-12 14:40:56.680782 CST; inval msgs: catcache 11 catcache 10
rmgr: Heap2       len (rec/tot):     60/    60, tx:          0, lsn: 3DB/F808D1A0, prev 3DB/F808D148, desc: CLEAN remxid 34954264, blkref #0: rel 1664/0/1260 blk 0
rmgr: Heap2       len (rec/tot):     60/    60, tx:   34954265, lsn: 3DB/F808D1E0,

The standby business database’s select 1 query was killed.

Conclusion: Shared tables can cause cross-database query conflicts.

That said, these shared system tables rarely see heavy updates in normal operations.

Conclusions
#

Developer Perspective
#

Query conflicts can be completely unrelated to the table being queried — meaning a fully static table CAN experience conflicts.

Cross-database means different business domains and data. Cross-database does NOT cause query conflicts. The one exception is shared tables, but these are just a handful of system tables that rarely see updates.

For developers, focus on:

  • Retry on failure: Standby queries can be killed — retrying is essential, and retries may succeed
  • Query duration: Longer queries are more likely to be killed
  • Alternative standbys: Consider using a different standby with lower disaster-recovery requirements

Operations Perspective
#

Since query conflicts can come from “all directions,” a simple long-running single-table query can be killed by in-page pruning on a completely different, frequently-updated table. You can increase max_standby_streaming_delay to reduce conflict probability.

However, max_standby_streaming_delay trades off against WAL apply — a longer delay means WAL application is paused. This parameter’s value directly represents the maximum possible standby replication lag (it can’t cap lag from network or other factors).

  • Query freshness: Prolonged WAL apply pauses mean the standby data lags significantly (WAL may already be on the standby’s disk), affecting data freshness requirements for other standby queries.
  • RTO: If the primary suffers a disaster and failover is needed, the standby must apply accumulated WAL. If apply delay stretches to hours, it may violate minute-level RTO SLAs.

So tuning max_standby_streaming_delay is a delicate exercise requiring consideration of the standby’s role, query freshness requirements, and even geography.

Related

Linux Memory Advanced

·11070 words·52 mins
(For memory basics, refer to Linux Memory Analysis; this article covers memory knowledge above that foundation) Memory Basic Concepts # buddy # The process of buddy system allocating and merging pages is omitted. Easily overlooked knowledge points: The prerequisite for buddy merging two blocks of the same size is that their physical addresses are contiguous The merge algorithm is iterative: after merging at the current level, it will automatically attempt to merge larger blocks. This means compactd is not strictly required for merging page table & PTE # page table and PTE are actually two different concepts, and they are easily confused because both generally refer to page tables. Below is relevant knowledge about page table and PTE[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]

PostgreSQL CLOG Files and Standby Synchronization Analysis

·3742 words·18 mins
Among all relational databases, PostgreSQL’s CLOG is a very special type of log. CLOG’s existence is inseparable from PostgreSQL’s MVCC mechanism. Some basic knowledge about transaction IDs and CLOG won’t be covered in this article. If interested, please refer to CLOG and Hint Bits. This article focuses on the structure of CLOG files, manually locating transaction states, and the CLOG WAL log synchronization mechanism, to further understand PostgreSQL’s CLOG. CLOG Segment # CLOG Directory # To distinguish from regular logs, PostgreSQL 10 renamed the CLOG and WAL directories 1: