跳过正文
  1. 文章/

vacuum啥时候截断末尾页?

·2659 字·6 分钟
liuzhilong62
作者
liuzhilong62
PostgreSQL DBA,关注数据库内核、案例分析、源码解读
C M

vacuum truncate
#

TRUNCATE—Specifies that VACUUM should attempt to truncate off any empty pages at the end of the table and allow the disk space for the truncated pages to be returned to the operating system. This is normally the desired behavior and is the default unless the vacuum_truncate option has been set to false for the table to be vacuumed. Setting this option to false may be useful to avoid ACCESS EXCLUSIVE lock on the table that the truncation requires. This option is ignored if the FULL option is used.

AKA,vacuum操作中的truncate选项是默认打开的,它会把表末尾的pages移除,移除的时候会获得一个表上的8级锁。 今天发现某个环境上delete from删除完所有数据后,autovacuum自动触发或者手动跑vacuum,都不会回收空间。 重现问题:

create table lzl1(a int);
insert into lzl1 select generate_series(1,1000) a;
analyze lzl1;
lzldb=#  select relname,relpages,reltuples from   pg_class where relname='lzl1';
 relname | relpages | reltuples 
---------+----------+-----------
 lzl1    |        5 |      1000

在relpage是5,最后一个页的页号是4

lzldb=> select t_ctid,lp,case lp_flags when 0 then '0:LP_UNUSED' when 1 then 'LP_NORMAL' when 2 then 'LP_REDIRECT'  when 3 then 'LP_DEAD' end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page('lzl1',4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp;
 t_ctid | lp | lp_flags  | t_xmin | t_xmax | t_cid |                raw_flags                | combined_flags | substring  
--------+----+-----------+--------+--------+-------+-----------------------------------------+----------------+------------
 (4,1)  |  1 | LP_NORMAL |    772 |      0 |     0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}             | \x89030000
 (4,2)  |  2 | LP_NORMAL |    772 |      0 |     0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}             | \x8a030000
 ...
delete from lzl1;
vacuum lzl1;
lzldb=> select t_ctid,lp,case lp_flags when 0 then '0:LP_UNUSED' when 1 then 'LP_NORMAL' when 2 then 'LP_REDIRECT'  when 3 then 'LP_DEAD' end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page('lzl1',4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp;
 t_ctid | lp |  lp_flags   | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | substring 
--------+----+-------------+--------+--------+-------+-----------+----------------+-----------
        |  1 | 0:LP_UNUSED |        |        |       |           |                | 
        lzldb=# select relname,relpages,reltuples from   pg_class where relname='lzl1';
 relname | relpages | reltuples 
---------+----------+-----------
 lzl2    |        5 |         0

看上去死元组都回收了,不过空间还是占用着的,page并没有回收 为什么表里面都没有数据了还不做截断操作?带着这个问题来一探究竟。

should_attempt_truncation函数源码分析
#

(未声明版本的都是PG11) 在vacuumlazy.c中有一个言简意赅的函数should_attempt_truncation,这个就是判断是否需要做truncation的函数:

static bool
should_attempt_truncation(LVRelStats *vacrelstats)
{
	BlockNumber possibly_freeable;

	possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
	if (possibly_freeable > 0 &&
		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
		 possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
		old_snapshot_threshold < 0)
		return true;
	else
		return false;
}

其中

#define REL_TRUNCATE_MINIMUM 1000
#define REL_TRUNCATE_FRACTION 16

所以是否truncation的判断逻辑必须满足以下条件:

  • 末尾空page数大于1000 或 末尾空page数大于总page数的1/16
  • old_snapshot_threshold<0

第一个规则是为了不用老是去截断那先零零碎碎的末尾空page,这回收不了多少空间,不仅浪费时间还浪费8级锁,没有必要。 第二个规则是这样解释的:

 * Also don't attempt it if we are doing early pruning/vacuuming, because a
 * scan which cannot find a truncated heap page cannot determine that the
 * snapshot is too old to read that page.  We might be able to get away with
 * truncating all except one of the pages, setting its LSN to (at least) the
 * maximum of the truncated range if we also treated an index leaf tuple
 * pointing to a missing heap page as something to trigger the "snapshot too
 * old" error, but that seems fragile and seems like it deserves its own patch
 * if we consider it.

“因为vacuum扫描还不能确认page上的数据是否有快照过久问题,里面还有些LSN、索引页问题,代码逻辑看起来很琐碎,如果需要这个功能的话,需要一个patch来解决这个问题” OK,看起来就是没有去判断page是否真的有快照过久问题,那么简单粗暴地以old_snapshot_threshold<0来判断,数据库本身关闭了快照过久问题,才会去truncation。

回到之前的vacuum没有回收的问题,因为是delete删除了所有数据,所以肯定满足“末尾空page数大于总page数的1/16”,然而环境中的old_snapshot_threshold确是打开的:

lzldb=> show  old_snapshot_threshold ;
 old_snapshot_threshold 
------------------------
 1h

关闭old_snapshot_threshold再去做这个delete全表+vacuum就会回收了。关闭old_snapshot_threshold需要重启数据库。

--重启后
lzldb=> show  old_snapshot_threshold ;
 old_snapshot_threshold 
------------------------
 -1
lzldb=> select pg_relation_filepath('lzl1');
 pg_relation_filepath 
----------------------
 base/16384/16446

lzldb=> vacuum lzl1; 
--pages成功回收
lzldb=#  select relname,relpages,reltuples from   pg_class where relname='lzl1';
 relname | relpages | reltuples 
---------+----------+-----------
 lzl1    |        0 |         0
--没有重建表 
lzldb=>  select pg_relation_filepath('lzl1');
 pg_relation_filepath 
----------------------
 base/16384/16446

pages全部成功回收,表没有重建。问题就算定位完了。 不过为了更进一步了解vacuum truncation机制,还可以继续看看下面的章节。

lazy_truncate_heap函数源码分析
#

仅仅通过should_attempt_truncation函数来判断是否truncation还不严谨,还要看一眼真正执行truncation的函数lazy_truncate_heap,里面还有一些判断

/*
 * lazy_truncate_heap - try to truncate off any empty pages at the end
 */
static void
lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
{
	BlockNumber old_rel_pages = vacrelstats->rel_pages;
	BlockNumber new_rel_pages;
	int			lock_retry;

	/* Report that we are now truncating */
	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
								 PROGRESS_VACUUM_PHASE_TRUNCATE);

	/*
	 * Loop until no more truncating can be done.
	 */
	do
	{
		PGRUsage	ru0;

		pg_rusage_init(&ru0);

		/*
		 * We need full exclusive lock on the relation in order to do
		 * truncation. If we can't get it, give up rather than waiting --- we
		 * don't want to block other backends, and we don't want to deadlock
		 * (which is quite possible considering we already hold a lower-grade
		 * lock).
		 */
		vacrelstats->lock_waiter_detected = false;
		lock_retry = 0;
		while (true)
		{	
			//如果可以获得锁,break while
			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
				break;

			/*
			 * Check for interrupts while trying to (re-)acquire the exclusive
			 * lock.
			 */
			CHECK_FOR_INTERRUPTS();
			
			//如果没有立即获得锁,最初(++lock_retry)=1,不大于100;当大于100时,不做truncation直接返回
			if (++lock_retry > (VACUUM_TRUNCATE_LOCK_TIMEOUT /
								VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL))
			{
				/*
				 * We failed to establish the lock in the specified number of
				 * retries. This means we give up truncating.
				 */
				vacrelstats->lock_waiter_detected = true;
				ereport(elevel,
						(errmsg("\"%s\": stopping truncate due to conflicting lock request",
								RelationGetRelationName(onerel))));
				return;
			}
			
			//sleep 50ms,看起来有点笨。理论上最长可以等50*100=5s
			pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);
		}

		//当我们拿到排他锁后,查看vacuum过程中是否有新元组,如有则不做truncation
		new_rel_pages = RelationGetNumberOfBlocks(onerel);
		if (new_rel_pages != old_rel_pages)
		{
			UnlockRelation(onerel, AccessExclusiveLock);
			return;
		}

		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
		//如果vacuum过程中有新元组写入,不做truncation
		if (new_rel_pages >= old_rel_pages)
		{
			/* can't do anything after all */
			UnlockRelation(onerel, AccessExclusiveLock);
			return;
		}

		/*
		 * Okay to truncate.
		 */
		RelationTruncate(onerel, new_rel_pages);

		//truncation后立即释放锁
		UnlockRelation(onerel, AccessExclusiveLock);

...
	} while (new_rel_pages > vacrelstats->nonempty_pages &&
			 vacrelstats->lock_waiter_detected);
}

其中

#define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* 微秒!! */
#define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* 微秒!! */

真正调用的主题函数是RelationTruncate,前面一大截都是在尝试获取AccessExclusiveLock。除了之前说的2个条件外,还有以下两种情况不会做truncation

  • 没有获得AccessExclusiveLock
  • vacuum过程中有新数据写入

vacuum truncate可能会等待5秒
#

上面翻lazy_truncate_heap源码的时候,发现循环获取锁那里的等待时间是有点笨:

pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);

这里每次循环会等待50ms,理论上最长可以等50*100=5s! 测试一把这个等待时间:

窗口1窗口2
create table lzl2;
alter table lzl2 set (autovacuum_enabled=off);;
insert into lzl2 select generate_series(1,1000) a;
delete from lzl2;
begin;
select * from lzl2;
\timing
vacuum lzl2; –Time: 5022.122 ms (00:05.022)
可以看到等待时间约5s。
手速快的话还可以再开个窗口抓下会话2的pstack
[postgres@cncq081298 lzl]$ pstack 4113
#0  0x00002b92a978c013 in __select_nocancel () from /lib64/libc.so.6
#1  0x000000000086225a in pg_usleep (microsec=microsec@entry=50000) at pgsleep.c:56
#2  0x00000000005e8212 in lazy_truncate_heap (vacrelstats=0xfc4490, onerel=0x2b92a8bc88d8) at vacuumlazy.c:1861
#3  lazy_vacuum_rel (onerel=onerel@entry=0x2b92a8bc88d8, options=options@entry=5, params=params@entry=0x7ffc96bb31d0, bstrategy=<optimized out>) at vacuumlazy.c:290
#4  0x00000000005e4551 in vacuum_rel (relid=32778, relation=<optimized out>, options=options@entry=5, params=params@entry=0x7ffc96bb31d0) at vacuum.c:1572
#5  0x00000000005e55ac in vacuum (options=5, relations=0xfc6540, params=params@entry=0x7ffc96bb31d0, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:340
...

它走到了lazy_truncate_heap上的pg_usleep ,传入entry=50000 microsec,实际上pg_usleep循环了100次,总等待时间是50000*100 microsec=5s。

后来PG15这段代码更优化了,把pg_usleep改成WaitLatch

(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL,
WAIT_EVENT_VACUUM_TRUNCATE);
ResetLatch(MyLatch);

vacuum truncate小结
#

vacuum触发truncation的条件(且):

  • 末尾空page数大于1000 或 末尾空page数大于总page数的1/16
  • old_snapshot_threshold<0
  • PG15(不含)以前需要在5s内获得8级锁AccessExclusiveLock
  • vacuum过程中没有新数据写入

相关文章

PG在还没有pg_class的时候怎么访问基础系统表?

·3748 字·8 分钟
在没有pg_class的时候,数据库怎么访问系统表?这个问题可以分成两个阶段来看: 数据库簇初始化,此时一个database都没有,所以怎么构造和访问pg_class等系统表是一个问题 私有内存初始化系统表。PG的系统表信息是放在backend本地进程上的,backend在初始化的时候又怎么load pg_class? 初始化数据字典 # 在数据库还没有初始化的时候,明显是不能通过访问数据字典来初始化database、pg_class等等对象的,因为一个库都没有就不能create database,也没有pg_class去查元数据信息。 PG通过bki文件的特殊语言初始化一些数据结构,然后在bootstrap模式初始化一个原始database1。

pg truncate浅析

·3936 字·8 分钟
命令选项 # TRUNCATE [ TABLE ] [ ONLY ] name [ * ] [, ... ] [ RESTART IDENTITY | CONTINUE IDENTITY ] [ CASCADE | RESTRICT ] 1.ONLY:只truncate指定的表。当表有继承子表或有子分区时,默认会一起truncate;only可只truncate继承父表。分区父表不能指定only

从很慢的唯一索引扫描到索引膨胀

·10077 字·21 分钟
走主键的SQL是怎么访问了多个数据页的? # 书接上回 :长事务、表膨胀、limit问题的一个经典案例,这篇文章有一个点没有说的很仔细: 为什么一个走主键的SQL会产生那么多shared hit? 为什么索引膨胀会导致访问多个数据页呢?页内的HOT只要一个数据页访问,页外的数据难道不可以通过访问对应的那一条索引条目来定位? 这跟索引的版本管理有关系了,其实索引还是有一点版本信息的,但不多。先温故一下pg的btree索引结构 (https://en.wikibooks.org/wiki/PostgreSQL/Index_Btree) 这个pg btree wiki图其实没有解释死元组和死索引条目的访问方式,它没有版本信息。目前不用硬理解这个结构的所有细节,知道有这么个btree结构就行。