Skip to main content
  1. Posts/

When Does VACUUM Truncate Empty Pages at the End of a Table?

·1534 words·8 mins
liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.

VACUUM Truncate
#

TRUNCATE—Specifies that VACUUM should attempt to truncate off any empty pages at the end of the table and allow the disk space for the truncated pages to be returned to the operating system. This is normally the desired behavior and is the default unless the vacuum_truncate option has been set to false for the table to be vacuumed. Setting this option to false may be useful to avoid ACCESS EXCLUSIVE lock on the table that the truncation requires. This option is ignored if the FULL option is used.

AKA, the truncate option in VACUUM is enabled by default. It removes empty pages at the end of the table, acquiring an AccessExclusiveLock (level 8 lock) on the table during the operation.

Today I found that in a certain environment, after deleting all data with DELETE FROM, neither autovacuum nor manual VACUUM reclaimed the space.

Reproducing the issue:

create table lzl1(a int);
insert into lzl1 select generate_series(1,1000) a;
analyze lzl1;
lzldb=#  select relname,relpages,reltuples from   pg_class where relname='lzl1';
 relname | relpages | reltuples
---------+----------+-----------
 lzl1    |        5 |      1000

relpages is 5, so the last page number is 4.

lzldb=> select t_ctid,lp,case lp_flags when 0 then '0:LP_UNUSED' when 1 then 'LP_NORMAL' when 2 then 'LP_REDIRECT'  when 3 then 'LP_DEAD' end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page('lzl1',4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp;
 t_ctid | lp | lp_flags  | t_xmin | t_xmax | t_cid |                raw_flags                | combined_flags | substring
--------+----+-----------+--------+--------+-------+-----------------------------------------+----------------+------------
 (4,1)  |  1 | LP_NORMAL |    772 |      0 |     0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}             | \x89030000
 (4,2)  |  2 | LP_NORMAL |    772 |      0 |     0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}             | \x8a030000
 ...
delete from lzl1;
vacuum lzl1;
lzldb=> select t_ctid,lp,case lp_flags when 0 then '0:LP_UNUSED' when 1 then 'LP_NORMAL' when 2 then 'LP_REDIRECT'  when 3 then 'LP_DEAD' end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page('lzl1',4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp;
 t_ctid | lp |  lp_flags   | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | substring
--------+----+-------------+--------+--------+-------+-----------+----------------+-----------
        |  1 | 0:LP_UNUSED |        |        |       |           |                |
        lzldb=# select relname,relpages,reltuples from   pg_class where relname='lzl1';
 relname | relpages | reltuples
---------+----------+-----------
 lzl2    |        5 |         0

It looks like all dead tuples were reclaimed, but the space is still occupied — the pages were not freed. Why doesn’t it truncate when the table is completely empty? Let’s dig into this question.

Source Code Analysis of should_attempt_truncation
#

(Unless otherwise noted, the version referenced is PG 11.)

In vacuumlazy.c there’s a pithily named function should_attempt_truncation — this is the function that decides whether truncation is needed:

static bool
should_attempt_truncation(LVRelStats *vacrelstats)
{
	BlockNumber possibly_freeable;

	possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
	if (possibly_freeable > 0 &&
		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
		 possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
		old_snapshot_threshold < 0)
		return true;
	else
		return false;
}

Where:

#define REL_TRUNCATE_MINIMUM 1000
#define REL_TRUNCATE_FRACTION 16

So the conditions for truncation are:

  • Number of empty trailing pages > 1000, or number of empty trailing pages > 1/16 of total pages
  • old_snapshot_threshold < 0

The first rule exists to avoid constantly truncating tiny bits of trailing empty pages — reclaiming that negligible space isn’t worth the time and the AccessExclusiveLock. It’s unnecessary.

The second rule is explained as follows:

 * Also don't attempt it if we are doing early pruning/vacuuming, because a
 * scan which cannot find a truncated heap page cannot determine that the
 * snapshot is too old to read that page.  We might be able to get away with
 * truncating all except one of the pages, setting its LSN to (at least) the
 * maximum of the truncated range if we also treated an index leaf tuple
 * pointing to a missing heap page as something to trigger the "snapshot too
 * old" error, but that seems fragile and seems like it deserves its own patch
 * if we consider it.

“Because VACUUM scanning cannot yet confirm whether page data has snapshot-too-old issues, and there are LSN and index page complications, the code logic looks fiddly. If this feature is needed, a dedicated patch would be required.”

OK, so it looks like the code simply doesn’t check whether a page actually has snapshot-too-old issues. It takes the blunt approach of checking old_snapshot_threshold < 0 — the database itself must have snapshot-too-old disabled before truncation is attempted.

Going back to the earlier problem where VACUUM didn’t reclaim space: since DELETE removed all data, the condition “empty trailing pages > 1/16 of total pages” was definitely satisfied. However, old_snapshot_threshold was actually enabled in that environment:

lzldb=> show  old_snapshot_threshold ;
 old_snapshot_threshold
------------------------
 1h

Disabling old_snapshot_threshold and then doing the delete-all + VACUUM will reclaim the space. Disabling old_snapshot_threshold requires a database restart.

-- After restart
lzldb=> show  old_snapshot_threshold ;
 old_snapshot_threshold
------------------------
 -1
lzldb=> select pg_relation_filepath('lzl1');
 pg_relation_filepath
----------------------
 base/16384/16446

lzldb=> vacuum lzl1;
-- Pages successfully reclaimed
lzldb=#  select relname,relpages,reltuples from   pg_class where relname='lzl1';
 relname | relpages | reltuples
---------+----------+-----------
 lzl1    |        0 |         0
-- Table not rebuilt
lzldb=>  select pg_relation_filepath('lzl1');
 pg_relation_filepath
----------------------
 base/16384/16446

All pages successfully reclaimed, table not rebuilt. Problem located.

But to understand the VACUUM truncation mechanism more deeply, let’s continue to the next section.

Source Code Analysis of lazy_truncate_heap
#

Relying solely on should_attempt_truncation to judge truncation isn’t rigorous enough. We also need to look at lazy_truncate_heap, the function that actually performs truncation, which has additional checks:

/*
 * lazy_truncate_heap - try to truncate off any empty pages at the end
 */
static void
lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
{
	BlockNumber old_rel_pages = vacrelstats->rel_pages;
	BlockNumber new_rel_pages;
	int			lock_retry;

	/* Report that we are now truncating */
	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
								 PROGRESS_VACUUM_PHASE_TRUNCATE);

	/*
	 * Loop until no more truncating can be done.
	 */
	do
	{
		PGRUsage	ru0;

		pg_rusage_init(&ru0);

		/*
		 * We need full exclusive lock on the relation in order to do
		 * truncation. If we can't get it, give up rather than waiting --- we
		 * don't want to block other backends, and we don't want to deadlock
		 * (which is quite possible considering we already hold a lower-grade
		 * lock).
		 */
		Vacrelstats->lock_waiter_detected = false;
		lock_retry = 0;
		while (true)
		{
			// If we can acquire the lock, break out of while
			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
				break;

			/*
			 * Check for interrupts while trying to (re-)acquire the exclusive
			 * lock.
			 */
			CHECK_FOR_INTERRUPTS();

			// If lock not immediately acquired, initially (++lock_retry)=1, <=100;
			// when >100, give up truncation and return
			if (++lock_retry > (VACUUM_TRUNCATE_LOCK_TIMEOUT /
								VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL))
			{
				/*
				 * We failed to establish the lock in the specified number of
				 * retries. This means we give up truncating.
				 */
				Vacrelstats->lock_waiter_detected = true;
				ereport(elevel,
						(errmsg("\"%s\": stopping truncate due to conflicting lock request",
								RelationGetRelationName(onerel))));
				return;
			}

			// Sleep 50ms. Looks a bit crude. Theoretical max wait: 50*100=5s
			pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);
		}

		// After acquiring the exclusive lock, check if new tuples arrived during VACUUM.
		// If so, don't truncate.
		new_rel_pages = RelationGetNumberOfBlocks(onerel);
		if (new_rel_pages != old_rel_pages)
		{
			UnlockRelation(onerel, AccessExclusiveLock);
			return;
		}

		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
		// If new tuples were written during VACUUM, don't truncate
		if (new_rel_pages >= old_rel_pages)
		{
			/* can't do anything after all */
			UnlockRelation(onerel, AccessExclusiveLock);
			return;
		}

		/*
		 * Okay to truncate.
		 */
		RelationTruncate(onerel, new_rel_pages);

		// Release lock immediately after truncation
		UnlockRelation(onerel, AccessExclusiveLock);

...
	} while (new_rel_pages > vacrelstats->nonempty_pages &&
			 vacrelstats->lock_waiter_detected);
}

Where:

#define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* microseconds!! */
#define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* microseconds!! */

The main function actually called is RelationTruncate. The bulk of the preceding code is all about trying to acquire the AccessExclusiveLock. Beyond the two conditions mentioned earlier, truncation also won’t happen in these two cases:

  • Failed to acquire AccessExclusiveLock
  • New data was written during the VACUUM

VACUUM Truncate May Wait Up to 5 Seconds
#

While reading the lazy_truncate_heap source code above, I noticed the lock acquisition retry loop has a somewhat crude wait:

pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);

Each loop iteration sleeps 50ms. The theoretical maximum wait is 50×100 = 5 seconds!

Let’s test this wait time:

Window 1Window 2
create table lzl2;
alter table lzl2 set (autovacuum_enabled=off);;
insert into lzl2 select generate_series(1,1000) a;
delete from lzl2;
begin;
select * from lzl2;
\timing
vacuum lzl2; – Time: 5022.122 ms (00:05.022)

We can see the wait time is about 5 seconds.

If you’re fast enough, you can open a third window and grab a pstack of session 2:

[postgres@cncq081298 lzl]$ pstack 4113
#0  0x00002b92a978c013 in __select_nocancel () from /lib64/libc.so.6
#1  0x000000000086225a in pg_usleep (microsec=microsec@entry=50000) at pgsleep.c:56
#2  0x00000000005e8212 in lazy_truncate_heap (vacrelstats=0xfc4490, onerel=0x2b92a8bc88d8) at vacuumlazy.c:1861
#3  lazy_vacuum_rel (onerel=onerel@entry=0x2b92a8bc88d8, options=options@entry=5, params=params@entry=0x7ffc96bb31d0, bstrategy=<optimized out>) at vacuumlazy.c:290
#4  0x00000000005e4551 in vacuum_rel (relid=32778, relation=<optimized out>, options=options@entry=5, params=params@entry=0x7ffc96bb31d0) at vacuum.c:1572
#5  0x00000000005e55ac in vacuum (options=5, relations=0xfc6540, params=params@entry=0x7ffc96bb31d0, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:340
...

It reached pg_usleep inside lazy_truncate_heap, passing entry=50000 microsec. In reality, pg_usleep looped 100 times, total wait time 50000×100 microseconds = 5 seconds.

Later, in PG 15, this code was improved by replacing pg_usleep with WaitLatch:

(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL,
WAIT_EVENT_VACUUM_TRUNCATE);
ResetLatch(MyLatch);

VACUUM Truncate Summary
#

Conditions for VACUUM to trigger truncation (all must be met):

  • Empty trailing pages > 1000, or empty trailing pages > 1/16 of total pages
  • old_snapshot_threshold < 0
  • Before PG 15 (exclusive): must acquire AccessExclusiveLock within 5 seconds
  • No new data written during the VACUUM

This article was originally published in Chinese on lastdba.com.

Related

A Brief Analysis of PostgreSQL TRUNCATE

·2240 words·11 mins
Command Options # TRUNCATE [ TABLE ] [ ONLY ] name [ * ] [, ... ] [ RESTART IDENTITY | CONTINUE IDENTITY ] [ CASCADE | RESTRICT ] 1. ONLY: truncate only the specified table. When a table has inheritance children or child partitions, by default they are truncated together; ONLY can truncate just the inheritance parent table. Partitioned parent tables cannot specify ONLY. -- Cannot truncate only a partitioned parent table => truncate only parttable; ERROR: 42809: cannot truncate only a partitioned table HINT: Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly. LOCATION: ExecuteTruncate, tablecmds.c:1655 -- truncate only the inheritance parent table, only the parent is cleaned => truncate table only parenttable; TRUNCATE TABLE => select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ------------+------- childtable | 1 -- Directly truncate the inheritance parent table, child tables are also cleaned => truncate table parenttable; TRUNCATE TABLE => select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ----------+------- (0 rows) 2. RESTART IDENTITY CONTINUE IDENTITY: whether to reset sequences on columns. Default is CONTINUE.

From Extremely Slow Unique Index Scan to Index Bloat

·6561 words·31 mins
How Did a Primary Key Query Access Multiple Data Pages? # Continuing from the previous article: A Classic Case of Long Transactions, Table Bloat, and LIMIT Problems, there was one point not explained in detail: Why does a query using the primary key generate so many shared hits? Why does index bloat cause access to multiple data pages? Can’t data outside the page be located through the corresponding index entry? This relates to index version management — indexes do carry some version information, but not much. Let’s first review PostgreSQL’s btree index structure.

How Does PG Access Basic System Tables Before pg_class Exists?

·1815 words·9 mins
How does the database access system tables before pg_class exists? This question can be divided into two stages: Database cluster initialization — at this point no database exists at all, so how to construct and access system tables like pg_class is a problem. Private memory initialization of system tables. PG stores system table information in the local backend process. How does the backend load pg_class during initialization? Initializing the Data Dictionary # When the database hasn’t been initialized yet, it’s obviously impossible to access the data dictionary to initialize objects like database, pg_class, etc., because without a database you can’t CREATE DATABASE, and without pg_class you can’t look up metadata information.