跳过正文
  1. 文章/

案例-起库失败和sysv共享内存

·7013 字·14 分钟
liuzhilong62
作者
liuzhilong62
PostgreSQL DBA,关注数据库内核、案例分析、源码解读
C M

问题现象
#

数据库实例RSS内存打满,日志有OOM信息,库挂了。这里不分析OOM原因。

但是起库的时候失败,从日志来看总共起库4、5次都失败:

2026-02-12 09:15:21 CST::@:[578272]: FATAL:  pre-existing shared memory block (key 2048, ID 1328250881) is still in use
2026-02-12 09:15:21 CST::@:[578272]: HINT:  Terminate any old server processes associated with data directory "/data".
2026-02-12 09:15:21 CST::@:[578272]: LOG:  database system is shut down
2026-02-12 09:21:03 CST::@:[658824]: FATAL:  pre-existing shared memory block (key 2048, ID 1328250881) is still in use
2026-02-12 09:21:03 CST::@:[658824]: HINT:  Terminate any old server processes associated with data directory "/data".
2026-02-12 09:21:03 CST::@:[658824]: LOG:  database system is shut down
2026-02-12 09:31:12 CST::@:[794791]: LOG:  redirecting log output to logging collector process
2026-02-12 09:31:12 CST::@:[794791]: HINT:  Future log output will appear in directory "/data/pg_log".
2026-02-12 09:31:37 CST::@:[801049]: FATAL:  lock file "postmaster.pid" already exists
2026-02-12 09:31:37 CST::@:[801049]: HINT:  Is another postmaster (PID 794791) running in data directory "/data"?
2026-02-12 09:32:34 CST::@:[814396]: FATAL:  lock file "postmaster.pid" already exists
2026-02-12 09:32:34 CST::@:[814396]: HINT:  Is another postmaster (PID 794791) running in data directory "/data"?

启动成功是因为DBA在起库前执行ipcrm -m xxx,然后启动成功的。

虽然快速解决问题,但是仍有很多疑问:

  • 为什么这种场景在现实种不算太多见?
  • start.log起库报错有2类,分别对应什么操作和逻辑?
  • 如果PM都不在了共享内存还可以存在吗?
  • 这段共享内存如何定位和清理?
  • PG共享内存有多段,这段共享内存是哪一段?
  • 除了ipcrm -m还有其他起库办法吗

报错分析:pre-existing shared memory block
#

3种共享内存
#

正常来说,PG起库后共享内存有三段。

以默认“shared_memory_type='mmap'+未使用大页”为例:

## 从PG申请的虚拟内存查看PG真实使用共享内存
cat /proc/`head -1 $PGDATA/postmaster.pid`/smaps | grep -E "\-s"
2b61b0563000-2b61b0564000 rw-s 00000000 00:04 116293664                  /SYSV00001000 (deleted)
2b61b057f000-2b61b05b3000 rw-s 00000000 00:12 1501001168                 /dev/shm/PostgreSQL.1193490778
2b61bbac2000-2b61fa67a000 rw-s 00000000 00:04 1500999610                 /dev/zero (deleted)

如上所示,从上往下分别是SYSV起库使用的共享内存并行计算使用的共享内存sharedbuffers使用的共享内存

如果sharedbuffers使用了大页,或者sharedbuffers type是SYSV而不是mmap,输出会稍微有些区别。

大页:

2aaaaac00000-2aba9ca00000 rw-s 00000000 00:0e 48453452                   /anon_hugepage (deleted)
2b08f2eea000-2b08f2eeb000 rw-s 00000000 00:04 50692152                   /SYSV00001000 (deleted)
2b08f2f05000-2b08f302d000 rw-s 00000000 00:12 48436142                   /dev/shm/PostgreSQL.1345689218

shared_memory_type = ‘sysv’:

2b03b3ceb000-2b03b3d1f000 rw-s 00000000 00:12 1572332304                 /dev/shm/PostgreSQL.2883611352
2b03bf0c2000-2b03fdc7a000 rw-s 00000000 00:04 143917075                  /SYSV00001000 (deleted)

汇总如下:

PG共享内存配置smaps共享内存段数sharedbuffers smapssysv smaps
shared_memory_type=mmap,没有大页3段共享内存/dev/zero/SYSV00001000
shared_memory_type=sysv,没有大页2段共享内存/SYSV00001000/SYSV00001000
shared_memory_type=mmap,有大页3段共享内存/anon_hugepage/SYSV00001000
shared_memory_type=sysv,有大页不支持不支持

那么现在问题来了,报错分析:pre-existing shared memory block时对应的哪个共享内存?

源码分析
#

源码搜报错很容易找到关键代码位置:src/backend/port/sysv_shmem.c

首先理解sysv shmem是干嘛的,以下截取自零散的readme:

We still require a SysV shmem block to
 * exist, though, because mmap'd shmem provides no way to find out how
 * many processes are attached, which we need for interlocking purposes.
 
 * As of PostgreSQL 9.3, we normally allocate only a very small amount of
 * System V shared memory, and only for the purposes of providing an
 * interlock to protect the data directory.  The real shared memory block
 * is allocated using mmap().  This works around the problem that many
 * systems have very low limits on the amount of System V shared memory
 * that can be allocated.  Even a limit of a few megabytes will be enough
 * to run many copies of PostgreSQL without needing to adjust system settings.
  • sysv shmem可以找共享内存是否是attached,mmap不能实现此功能
  • 这段sysv shmem是用来保护datadir的;shared buffer用的是mmap(默认)不是sysv
  • 这段sysv shmem非常小(从虚拟内存地址可以看出申请的是4K=2b61b0563000-2b61b0564000)

再看shm的状态enum:

typedef enum
{
	SHMSTATE_ANALYSIS_FAILURE,	/* unexpected failure to analyze the ID */
	SHMSTATE_ATTACHED,			/* pertinent to DataDir, has attached PIDs */
	SHMSTATE_ENOENT,			/* no segment of that ID */
	SHMSTATE_FOREIGN,			/* exists, but not pertinent to DataDir */
	SHMSTATE_UNATTACHED			/* pertinent to DataDir, no attached PIDs */
} IpcMemoryState;

主要是关注ATTACHED,FOREIGN,UNATTACHED。

sysv shmem是用来保护datadir目录的,比如常见的场景是要确认这个目录不会被跑2个实例。既然有shmem共享内存,那么因为各种奇怪原因,这个共享内存也有可能不是这个目录或者这个进程的,所有是FOREIGN状态。如果共享内存对应到datadir了,但没有进程在运行,那么应该是UNATTACHED,有进程运行那么是ATTACHED。

这时再来看PGSharedMemoryCreate函数抛出的报错:

PGShmemHeader *
PGSharedMemoryCreate(Size size,
					 PGShmemHeader **shim)
{...
    for (;;)  //死循环
	{..
        shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);//shmget获取shmem共享内存并返回shmid
		if (shmid < 0)
		{
			oldhdr = NULL;
			state = SHMSTATE_FOREIGN;
		}
		else
			state = PGSharedMemoryAttach(shmid, NULL, &oldhdr);//找到这段shmem共享内存的状态
            
        switch (state)//根据共享内存的状态执行不同的动作
		{   
            ...//这里只展示了2种,shm有attach和没有attach
			case SHMSTATE_ATTACHED: //shm有attach的情况,抛出报错(也就是问题现象出现的报错)
				ereport(FATAL,
						(errcode(ERRCODE_LOCK_FILE_EXISTS),
						 errmsg("pre-existing shared memory block (key %lu, ID %lu) is still in use",
								(unsigned long) NextShmemSegID,
								(unsigned long) shmid),
						 errhint("Terminate any old server processes associated with data directory \"%s\".",
								 DataDir)));
				break;
            ...
			case SHMSTATE_UNATTACHED://shm是unattach的

				/*
				 * The segment pertains to DataDir, and every process that had
				 * used it has died or detached.  Zap it, if possible, and any
				 * associated dynamic shared memory segments, as well.  This
				 * shouldn't fail, but if it does, assume the segment belongs
				 * to someone else after all, and try the next candidate.
				 * Otherwise, try again to create the segment.  That may fail
				 * if some other process creates the same shmem key before we
				 * do, in which case we'll try the next key.
				 */
                //代表内存段关联Data目录,且没有进程还持有这个段
				if (oldhdr->dsm_control != 0)
					dsm_cleanup_using_control_segment(oldhdr->dsm_control);
				if (shmctl(shmid, IPC_RMID, NULL) < 0)
					NextShmemSegID++;   //注意这里的ShmemSegID递增循环
				break;
		}
     ...
     }
 ...
 }

可以看到shmem attached时会抛出报错。如果没有attach,会无限循环尝试清理这段共享内存并shmemsegid+1申请新的共享内存。

  • 第一种情况对应这个故障
  • 第二种情况对应实例崩溃仍然可以正常起库

sysv shmem
#

PG10及以后postmaster.pid,sysv_shmem相关的逻辑大改,10以后基本没有变过。本文只分析了10以后的逻辑。

pidfile.h:

#define LOCK_FILE_LINE_SHMEM_KEY	7

sysv_shmem.c,InternalIpcMemoryCreate():

	{
		char		line[64];

		sprintf(line, "%9lu %9lu",
				(unsigned long) memKey, (unsigned long) shmid);
		AddToDataDirLockFile(LOCK_FILE_LINE_SHMEM_KEY, line);
	}

从源码可以看出,shmem信息保存在postmaster.pid文件第七行,分别写的是shmkey和shmid。

> cat postmaster.pid 
242712
/data
1772698474
8531
/tmp
0.0.0.0
     4096 143917078   # <----here
ready

什么是shmkey和shmid
#

在pg源码中是这样调用的,InternalIpcMemoryCreate():

			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection);

PG以shmkey/memkey为种子key,向内核申请shmem并返回唯一标识符shmid

shmid高度依赖服务器或者说服务器内存的状态。对于PG来说,快速重启实例,前后的shmid可能会相同或者+1,这跟linux内核机制相关;服务器重启那就完全不一样。

可以这样增加理解度:无论服务器是否重启,shmkey/memkey都可以是固定值,因为毕竟是用户输入(即PG);而在服务器重启前后,即便传入同一shmkey,获取的shmid不太可能是同一值。

PG是怎么拿shmkey的
#

PGSharedMemoryCreate():

	/*
	 * We use the data directory's ID info (inode and device numbers) to
	 * positively identify shmem segments associated with this data dir, and
	 * also as seeds for searching for a free shmem key.
	 */
	if (stat(DataDir, &statbuf) < 0)
		ereport(FATAL,
				(errcode_for_file_access(),
				 errmsg("could not stat data directory \"%s\": %m",
						DataDir)));
...
	/*
	 * Loop till we find a free IPC key.  Trust CreateDataDirLockFile() to
	 * ensure no more than one postmaster per data directory can enter this
	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
	 * that, but prefer fixing it over coping here.)
	 */
	NextShmemSegID = statbuf.st_ino;

	for (;;)
	{
		IpcMemoryId shmid;
		PGShmemHeader *oldhdr;
		IpcMemoryState state;

		/* Try to create new segment */
		memAddress = InternalIpcMemoryCreate(NextShmemSegID, sysvsize);
		if (memAddress)
			break;				/* successful create and attach */

		/* Check shared memory and possibly remove and recreate */

		/*
		 * shmget() failure is typically EACCES, hence SHMSTATE_FOREIGN.
		 * ENOENT, a narrow possibility, implies SHMSTATE_ENOENT, but one can
		 * safely treat SHMSTATE_ENOENT like SHMSTATE_FOREIGN.
		 */
		shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);

PG通过stat获取datadir的状态,其中包含datadir的inode,PG直接将datadir.inode当作shmkey。

在PG中shmem key跟datadir的inode强相关,一般情况下shmem key=datadir inode

验证示例:

> ls -id $PGDATA
4096 /lzlcloud/pg8574/data
> cat postmaster.pid |head -7|tail -1
     4096 143917090

可以看到datadir.inode=shmkey=4096。

PG在云环境下的shmkey
#

上面说一般情况shmid=datadir.inode,实际上这在云环境中基本不是这个情况。

我们的云环境:

>  ls -id /lzlcloud/pg8298/data
4096 /lzlcloud/pg8298/data
>  ls -id /lzlcloud/pg8388/data
4096 /lzlcloud/pg8388/data
>  ls -id /lzlcloud/pg8095/data
4096 /lzlcloud/pg8095/data
>  cat /lzlcloud/pg8298/data/postmaster.pid|head -7|tail -1
     4096 971833391
>  cat /lzlcloud/pg8388/data/postmaster.pid|head -7|tail -1
     4097  62128161
>  cat /lzlcloud/pg8095/data/postmaster.pid|head -7|tail -1
     4098 143163441

data盘dir的inode都是4096,而shmkey是4096、4097、4098

why?

inode的问题跟磁盘的文件系统有关系:

  • 每个文件系统有独立的inode
  • 文件系统预留了一些inode,前几位是不能使用的。根据不同的挂载方式,我们data盘真正的inode从4096开始

也就是说datadir.inode=4096这是我们云环境磁盘挂载的默认行为。其他环境可能不一样,未深入分析。不过以相同文件系统和相同方式挂载挂载pg datadir的话,仍有可能inode数值相等。

shmkey的问题跟PG源码相关,PGSharedMemoryCreate():

	for (;;)
	{	
        ...
        NextShmemSegID = statbuf.st_ino;
        ...
		shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);
		...
		switch (state)
		{
			case SHMSTATE_FOREIGN:
				NextShmemSegID++;
				break;

本来shmkey=datadir.inode,但是由于可能申请到shmem是foreign的,所以shmkey+1再申请一次。

例如postmaster.pid文件shmkey=4097的那个实例,它起库时shmkey=4096,但是发现shmid那个内存段被其他实例使用了(就是另一个shmkey=4096的PG实例),它让shmkey+1再申请了另一个shmid共享内存段。

同理shmkey=4098的那个实例加了2次才找到空闲的shmkey对应的shmid。

shmid的关联性
#

sysv的shmid可以在起库的报错日志postmaster.pid文件第7行虚拟内存地址smaps中均可以找到,并通过sysv共享内存命令的ipcs命令查看和ipcrm命令清理。

示例:注意以下shmid=143917078

起库报错日志:

pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 16:02:19 CST::@:[262388]: FATAL:  pre-existing shared memory block (key 4096, ID 143917078) is still in use

postmaster.pid文件第七行:

> cat postmaster.pid |head -7|tail -1
     4096 143917078

虚拟内存smaps:

cat /proc/`head -1 $PGDATA/postmaster.pid`/smaps | grep -E "\-s"  
2ad2b5189000-2ad2b518a000 rw-s 00000000 00:04 143917078                  /SYSV00001000 (deleted)

通过shmid sysv共享内存id查看和清理:

ipcs -m -i  143917078  #清理:ipcrm -m shmid

Shared memory Segment shmid=143917078
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=242712     cpid=242712     nattch=10
att_time=Thu Mar  5 16:14:51 2026  
det_time=Thu Mar  5 16:14:49 2026  
change_time=Thu Mar  5 16:14:34 2026  

测试
#

生产问题复现
#

持有一个backend进程永不退出,kill -9 PM

> cat postmaster.pid
      4096 143917076
      
> ipcs -m -i  143917076  #shmem id
Shared memory Segment shmid=143917076
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=241567     cpid=64757      nattch=23 

> kill -stop 107648 #任意一个backend

> kill -9 64757 #postmaster或者其他的

> ipcs -m -i  143917076
Shared memory Segment shmid=143917076
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=252283     cpid=64757      nattch=1   #nattch != 0

> pg_ctl start -D $PGDATA
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 16:02:19 CST::@:[262388]: FATAL:  pre-existing shared memory block (key 4096, ID 143917076) is still in use
2026-03-05 16:02:19 CST::@:[262388]: HINT:  Terminate any old server processes associated with data directory "/data".
 stopped waiting
pg_ctl: could not start server

nattach=1,实例无法启动。

实例奔溃正常起库
#

其实就是kill实例然后启动

> cat postmaster.pid     
     4096 143917077
     
> ipcs -m -i  143917077 #shmem id
Shared memory Segment shmid=143917077
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=154800     cpid=134329     nattch=18

> kill -9 134329 #postmaster或者其他的

> cat postmaster.pid     
     4096 143917077

> ipcs -m -i  143917077 #shmem id没有改变,shmem仍然存在
Shared memory Segment shmid=143917077
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=169360     cpid=134329     nattch=0 #nattch=0

> ipcs -m -i  143917077 #shmem id没有改变,shmem仍然存在

> pg_ctl start -D $PGDATA  # 起库成功
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 16:14:34 CST::@:[242712]: LOG:  redirecting log output to logging collector process
2026-03-05 16:14:34 CST::@:[242712]: HINT:  Future log output will appear in directory "/data/pg_log".
 done
server started

> ipcs -m -i  143917077 #残留的shmem起库时被清理
ipcs: id 143917077 not found
> ipcs -m -i  143917078 #shmemid起库时被+1
Shared memory Segment shmid=143917078
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=273571     cpid=242712     nattch=26

> cat postmaster.pid # shmkey不变,shmid+1
     4096 143917078

正常kill -9然后启动,可以正常启动,残留的shmem会在启动时被清理。shmkey不变是因为inode=4096且shmkey=4096没有被占用,shmid+1这是linux内核行为,至少说明不是使用的同一段shmem。

持有文件但不持有shmem
#

因为起库跟datadir inode相关,inode跟shmem id相关,起库本质上是在检查shmem是不是被其他进程持有,而不是文件fd是否还被其他进程持有。所以这里测试不持有共享内存但持有文件fd的进程logger。

$ cat /proc/77300/smaps | grep -E "\-s"  #这是logger进程,检查它没有用共享内存
$ kill -stop 77300  #stop logger
$ kill -9 77076  #kill -9 pm
$ cat postmaster.pid   #文件仍在
77076
/lzlcloud/pg8531/data
1772700343
8531
/tmp
0.0.0.0
     4096 143917080
ready   
$ ipcs -m -i 143917080  #共享内存仍在

Shared memory Segment shmid=143917080
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=77319      cpid=77076      nattch=0
att_time=Thu Mar  5 17:27:11 2026  
det_time=Thu Mar  5 17:27:15 2026  
change_time=Thu Mar  5 16:45:43 2026  

$ ps -ef|grep 77300  #进程仍在
postgres  77300      1  0 16:45 ?        00:00:00 postgresql: lzldb: logger 
postgres 135246  46622  0 17:27 pts/1    00:00:00 grep --color=auto 77300
$ pg_ctl start -D $PGDATA  #起库成功
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 17:27:55 CST::@:[140497]: LOG:  redirecting log output to logging collector process
2026-03-05 17:27:55 CST::@:[140497]: HINT:  Future log output will appear in directory "/data/pg_log".
 done
server started

logger持有data目录下的文件,但不关联共享内存,不会阻止起库

删除postmaster.pid文件起库失败
#

流程跟上面差不多:持有1个backend进程,kill -9 PM,删除postmaster.pid文件,起库。

过程不贴了,结果是起库失败,报错如下:

waiting for server to start....2026-03-06 15:29:48 CST::@:[22475]: FATAL:  pre-existing shared memory block (key 4098, ID 171868173) is still in use
2026-03-06 15:29:48 CST::@:[22475]: HINT:  Terminate any old server processes associated with data directory "/data".
2026-03-06 15:29:48 CST::@:[22475]: LOG:  database system is shut down

可以看出,有僵尸进程持有shmem的情况下,即便删除包含shmid的postmaster.pid文件,PG仍然能找到对应的shmid。

关闭一个其他库,启动当前库
#

pg会分析2个地方shmid是否是当前的

  1. 以datadir.inode当作shmkey对应的shmid,或者shmkey++
  2. postmaster.pid中的shmid

即便直接删除postmaster.pid,PG仍然可以知道shmem是不是被其他进程持有。但是我们可以通过datadir.inode和shmkey++的特性让他起库。

因为根据之前分析,我们云环境datadir inode都是4096,shmkey不同是因为源码有shmkey++的逻辑。所以我们可以:启动或停止一个datadir.inode=4096的PG库,让当前PG库启动时shmkey++多一个或者少一个,拿到不同的shmid。

$ kill -stop 165245 
$ kill -9 164411  #停当前库并持有一个当前库backend进程

$ pg_ctl stop -D  /pg8531/data  # 停一个其他库
waiting for server to shut down.... done
server stopped
$ pg_ctl start -D /pg8574/data # 启动当前库,会失败,因为postmaster.pid没有删除
rase_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 18:22:35 CST::@:[196209]: FATAL:  pre-existing shared memory block (key 4097, ID 143917087) is still in use
2026-03-05 18:22:35 CST::@:[196209]: HINT:  Terminate any old server processes associated with data directory "/pg8574/data".
 stopped waiting
rase_ctl: could not start server
Examine the log output.

$ mv /lzlcloud/pg8574/data/postmaster.pid{,.bak}  # 删除当前库的postmaster.pid
$ pg_ctl start -D /lzlcloud/pg8574/data  #再起当前库,成功
2026-03-05 18:23:09 CST::@:[207725]: LOG:  redirecting log output to logging collector process
2026-03-05 18:23:09 CST::@:[207725]: HINT:  Future log output will appear in directory "/lzlcloud/pg8574/data/pg_log".
 done
server started

$ ipcs -m -i 143917087 #shmid对应的sysv共享内存仍然被我们持有

Shared memory Segment shmid=143917087
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=196209     cpid=164411     nattch=1
att_time=Thu Mar  5 18:22:35 2026  
det_time=Thu Mar  5 18:22:35 2026  
change_time=Thu Mar  5 18:21:04 2026  

可以启动,当前库共享内存申请了另一块,之前那个共享内存没有被清理。这就是在云环境下关其他库启动当前库的骚操作。

这里有个小小的前提,关的其他库不仅要inode=当前库inode,还要其他库shmkey<当前库shmkey。

报错分析:lock file "postmaster.pid" already exists
#

这个问题比“共享内存已存在”简单多了。

起库时本身就会检查lock file、lock file中的pid,CreateLockFile():

		if (other_pid != my_pid && other_pid != my_p_pid &&
			other_pid != my_gp_pid)
		{
			if (kill(other_pid, 0) == 0 ||
				(errno != ESRCH && errno != EPERM))
			{
				/* lockfile belongs to a live process */
				ereport(FATAL,
						(errcode(ERRCODE_LOCK_FILE_EXISTS),
						 errmsg("lock file \"%s\" already exists",
								filename),
						 isDDLock ?
						 (encoded_pid < 0 ?
						  errhint("Is another postgres (PID %d) running in data directory \"%s\"?",
								  (int) other_pid, refName) :
						  errhint("Is another postmaster (PID %d) running in data directory \"%s\"?",
								  (int) other_pid, refName)) :
						 (encoded_pid < 0 ?
						  errhint("Is another postgres (PID %d) using socket file \"%s\"?",
								  (int) other_pid, refName) :
						  errhint("Is another postmaster (PID %d) using socket file \"%s\"?",
								  (int) other_pid, refName))));
			}
		}

测试就更简单,在库启动的时候再启动一次:

$ pg_ctl start -D /pg8531/data
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-06 15:59:05 CST::@:[89145]: FATAL:  lock file "postmaster.pid" already exists
2026-03-06 15:59:05 CST::@:[89145]: HINT:  Is another postmaster (PID 255500) running in data directory "/pg8531/data"?
 stopped waiting
pg_ctl: could not start server
Examine the log output.

所以故障时的start.log后面几个报错是因为库已经启动了,多启动了几次。

总结
#

PG在起库时,会先开辟一个sysv shmem(不是mmap对应的share buffers)以锁定datadir。锁定是通过datadir的inode号当作shmkey通过shmget申请的,并返回shmem唯一标识符shmid。由于可能申请的shmem被其他进程使用,PG会让shmkey++无限循环指到申请到没有被人占用的shmem。postmaster.pid第七行分别保存shmkey和shmid。在云环境下通常可以看到共享PG实例的shmkey递增的现象,这是因为data盘挂载方式相同使用了相同的inode,shmkey++导致。

如果PG实例被意外干掉,shmem不会被清理,正常情况下没有僵尸进程持有共享内存,那么起库会清理这段shmem并正常起库;异常情况下僵尸进程持有共享内存,起库会失败,此时需要介入处理。

推荐的处理方式:

  1. ipcrm -m(最推荐)
  2. lsof找到僵尸进程并kill
  3. 重启主机

不推荐但可以起库的方式:

  1. mv postmaster.pid+关闭一个其他PG库(其他PG库的shmkey<当前PG库)
  2. mv postmaster.pid+重新挂载data盘并改变inode

最后回答开头的问题:

  • 为什么这种场景在现实种不算太多见?

实例异常宕机+仍然有僵尸进程没有被清理。有些情况是异常宕机没有僵尸进程,正常起库就行了。

  • start.log起库报错有2类,分别对应什么操作和逻辑?

共享内存被占用的报错是因为实例异常宕机+仍然有僵尸进程;postmaster.pid存在的报错是因为起库多次

  • 如果PM都不在了共享内存还可以存在吗?

PM都不在了共享内存可以存在,PG的进程不一定会自己跑挂或者被OS处理;但是所有进程都不在了共享内存应该不存在

  • 这段共享内存如何定位和清理?

起库的start.log可以找到shmid,ipcrm -m $shmid命令可以清理。

  • PG共享内存有多段,这段共享内存是哪一段?

sysv shmem,用于保护datadir,一定存在,参考“三种共享内存”部分。与mmap下的sharebuffers是2个东西。

  • 可以通过inode或者文件找到对应的shmem吗?

LINUX在用户态没有提供通过inode或者文件找到对应shmem的接口(这句话AI含量100%,经过多个模型交叉验证)。PG是通过datadir的inode当作种子shmkey去申请的shmem共享内存,本质上不是通过inode直接找到对应的shmem,PG对shmem共享内存使用自己的寻找机制,但不绝对对应,shkey++就是一个折衷起库逻辑。

相关文章

pg数据库运维经验2025

·14802 字·30 分钟
主要是技术性运维总结,主打通俗易懂和快速上手,同时也是对PG数据库运维的阶段性总结,希望对PGer有所帮助。 历史的运维经验:pg数据库运维经验2024。注意,本篇不会包含历史的运维经验的内容。 CPU # SQL性能问题是PG异常处理根因中最多的,这包含SQL本身性能不好,索引一般、突发并发高、执行计划突变。对于postgres这种没有完善的绑定执行计划的方案的库来说,有一个DBA团队帮助设计数据模型、数据访问方式、索引、调整执行计划等显得尤为重要,实际上可以极大缓解CPU突然打满的问题。

案例-20260101分区数据更新失败

·2478 字·5 分钟
现象 # 12月30日发现业务报错,数据无法更新: ERROR: 55000: cannot update table "tablzl_202601" because it does not have a replica identity and publishes updates HINT: To enable updating the table, set REPLICA IDENTITY using ALTER TABLE. LOCATION: CheckCmdReplicaIdentity, execReplication.c:575 临时恢复 # 因为报错信息充足,没有复制标识,表是一个分区表,又是26年的分区,所以直接怀疑新分区没有主键(新表的复制标识是default,default只能用主键作为复制标识,没有主键就会无法更新)。

案例-行锁与LWlock-lockmanger

·2916 字·6 分钟
现象 # 数据库有大量行锁和少部分LWlock lockmanger,cpu打满,活动会话飙升。锁对应的block pid在变化,未见明显长事务阻塞。 (脑补cpu和活动会话高)