案例-起库失败和sysv共享内存

问题现象
#

数据库实例RSS内存打满，日志有OOM信息，库挂了。这里不分析OOM原因。

但是起库的时候失败，从日志来看总共起库4、5次都失败：

2026-02-12 09:15:21 CST::@:[578272]: FATAL:  pre-existing shared memory block (key 2048, ID 1328250881) is still in use
2026-02-12 09:15:21 CST::@:[578272]: HINT:  Terminate any old server processes associated with data directory "/data".
2026-02-12 09:15:21 CST::@:[578272]: LOG:  database system is shut down
2026-02-12 09:21:03 CST::@:[658824]: FATAL:  pre-existing shared memory block (key 2048, ID 1328250881) is still in use
2026-02-12 09:21:03 CST::@:[658824]: HINT:  Terminate any old server processes associated with data directory "/data".
2026-02-12 09:21:03 CST::@:[658824]: LOG:  database system is shut down
2026-02-12 09:31:12 CST::@:[794791]: LOG:  redirecting log output to logging collector process
2026-02-12 09:31:12 CST::@:[794791]: HINT:  Future log output will appear in directory "/data/pg_log".
2026-02-12 09:31:37 CST::@:[801049]: FATAL:  lock file "postmaster.pid" already exists
2026-02-12 09:31:37 CST::@:[801049]: HINT:  Is another postmaster (PID 794791) running in data directory "/data"?
2026-02-12 09:32:34 CST::@:[814396]: FATAL:  lock file "postmaster.pid" already exists
2026-02-12 09:32:34 CST::@:[814396]: HINT:  Is another postmaster (PID 794791) running in data directory "/data"?

启动成功是因为DBA在起库前执行ipcrm -m xxx，然后启动成功的。

虽然快速解决问题，但是仍有很多疑问：

为什么这种场景在现实种不算太多见？
start.log起库报错有2类，分别对应什么操作和逻辑？
如果PM都不在了共享内存还可以存在吗？
这段共享内存如何定位和清理？
PG共享内存有多段，这段共享内存是哪一段？
除了ipcrm -m还有其他起库办法吗

报错分析：`pre-existing shared memory block`
#

3种共享内存
#

正常来说，PG起库后共享内存有三段。

以默认“shared_memory_type='mmap'+未使用大页”为例：

## 从PG申请的虚拟内存查看PG真实使用共享内存
cat /proc/`head -1 $PGDATA/postmaster.pid`/smaps | grep -E "\-s"

2b61b0563000-2b61b0564000 rw-s 00000000 00:04 116293664                  /SYSV00001000 (deleted)
2b61b057f000-2b61b05b3000 rw-s 00000000 00:12 1501001168                 /dev/shm/PostgreSQL.1193490778
2b61bbac2000-2b61fa67a000 rw-s 00000000 00:04 1500999610                 /dev/zero (deleted)

如上所示，从上往下分别是SYSV起库使用的共享内存、并行计算使用的共享内存、sharedbuffers使用的共享内存。

如果sharedbuffers使用了大页，或者sharedbuffers type是SYSV而不是mmap，输出会稍微有些区别。

大页：

2aaaaac00000-2aba9ca00000 rw-s 00000000 00:0e 48453452                   /anon_hugepage (deleted)
2b08f2eea000-2b08f2eeb000 rw-s 00000000 00:04 50692152                   /SYSV00001000 (deleted)
2b08f2f05000-2b08f302d000 rw-s 00000000 00:12 48436142                   /dev/shm/PostgreSQL.1345689218

shared_memory_type = ‘sysv’：

2b03b3ceb000-2b03b3d1f000 rw-s 00000000 00:12 1572332304                 /dev/shm/PostgreSQL.2883611352
2b03bf0c2000-2b03fdc7a000 rw-s 00000000 00:04 143917075                  /SYSV00001000 (deleted)

汇总如下：

PG共享内存配置	smaps共享内存段数	sharedbuffers smaps	sysv smaps
shared_memory_type=mmap，没有大页	3段共享内存	/dev/zero	/SYSV00001000
shared_memory_type=sysv，没有大页	2段共享内存	/SYSV00001000	/SYSV00001000
shared_memory_type=mmap，有大页	3段共享内存	/anon_hugepage	/SYSV00001000
shared_memory_type=sysv，有大页	不支持	不支持

那么现在问题来了，报错分析：pre-existing shared memory block时对应的哪个共享内存？

源码分析
#

源码搜报错很容易找到关键代码位置：src/backend/port/sysv_shmem.c

首先理解sysv shmem是干嘛的，以下截取自零散的readme：

We still require a SysV shmem block to
 * exist, though, because mmap'd shmem provides no way to find out how
 * many processes are attached, which we need for interlocking purposes.
 
 * As of PostgreSQL 9.3, we normally allocate only a very small amount of
 * System V shared memory, and only for the purposes of providing an
 * interlock to protect the data directory.  The real shared memory block
 * is allocated using mmap().  This works around the problem that many
 * systems have very low limits on the amount of System V shared memory
 * that can be allocated.  Even a limit of a few megabytes will be enough
 * to run many copies of PostgreSQL without needing to adjust system settings.

sysv shmem可以找共享内存是否是attached，mmap不能实现此功能
这段sysv shmem是用来保护datadir的；shared buffer用的是mmap（默认）不是sysv
这段sysv shmem非常小（从虚拟内存地址可以看出申请的是4K=2b61b0563000-2b61b0564000）

再看shm的状态enum：

typedef enum
{
	SHMSTATE_ANALYSIS_FAILURE,	/* unexpected failure to analyze the ID */
	SHMSTATE_ATTACHED,			/* pertinent to DataDir, has attached PIDs */
	SHMSTATE_ENOENT,			/* no segment of that ID */
	SHMSTATE_FOREIGN,			/* exists, but not pertinent to DataDir */
	SHMSTATE_UNATTACHED			/* pertinent to DataDir, no attached PIDs */
} IpcMemoryState;

主要是关注ATTACHED,FOREIGN,UNATTACHED。

sysv shmem是用来保护datadir目录的，比如常见的场景是要确认这个目录不会被跑2个实例。既然有shmem共享内存，那么因为各种奇怪原因，这个共享内存也有可能不是这个目录或者这个进程的，所有是FOREIGN状态。如果共享内存对应到datadir了，但没有进程在运行，那么应该是UNATTACHED，有进程运行那么是ATTACHED。

这时再来看PGSharedMemoryCreate函数抛出的报错：

PGShmemHeader *
PGSharedMemoryCreate(Size size,
					 PGShmemHeader **shim)
{...
    for (;;)  //死循环
	{..
        shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);//shmget获取shmem共享内存并返回shmid
		if (shmid < 0)
		{
			oldhdr = NULL;
			state = SHMSTATE_FOREIGN;
		}
		else
			state = PGSharedMemoryAttach(shmid, NULL, &oldhdr);//找到这段shmem共享内存的状态
            
        switch (state)//根据共享内存的状态执行不同的动作
		{   
            ...//这里只展示了2种，shm有attach和没有attach
			case SHMSTATE_ATTACHED: //shm有attach的情况，抛出报错（也就是问题现象出现的报错）
				ereport(FATAL,
						(errcode(ERRCODE_LOCK_FILE_EXISTS),
						 errmsg("pre-existing shared memory block (key %lu, ID %lu) is still in use",
								(unsigned long) NextShmemSegID,
								(unsigned long) shmid),
						 errhint("Terminate any old server processes associated with data directory \"%s\".",
								 DataDir)));
				break;
            ...
			case SHMSTATE_UNATTACHED://shm是unattach的

				/*
				 * The segment pertains to DataDir, and every process that had
				 * used it has died or detached.  Zap it, if possible, and any
				 * associated dynamic shared memory segments, as well.  This
				 * shouldn't fail, but if it does, assume the segment belongs
				 * to someone else after all, and try the next candidate.
				 * Otherwise, try again to create the segment.  That may fail
				 * if some other process creates the same shmem key before we
				 * do, in which case we'll try the next key.
				 */
                //代表内存段关联Data目录，且没有进程还持有这个段
				if (oldhdr->dsm_control != 0)
					dsm_cleanup_using_control_segment(oldhdr->dsm_control);
				if (shmctl(shmid, IPC_RMID, NULL) < 0)
					NextShmemSegID++;   //注意这里的ShmemSegID递增循环
				break;
		}
     ...
     }
 ...
 }

可以看到shmem attached时会抛出报错。如果没有attach，会无限循环尝试清理这段共享内存并shmemsegid+1申请新的共享内存。

第一种情况对应这个故障
第二种情况对应实例崩溃仍然可以正常起库

sysv shmem
#

PG10及以后postmaster.pid，sysv_shmem相关的逻辑大改，10以后基本没有变过。本文只分析了10以后的逻辑。

pidfile.h:

#define LOCK_FILE_LINE_SHMEM_KEY	7

sysv_shmem.c，InternalIpcMemoryCreate()：

	{
		char		line[64];

		sprintf(line, "%9lu %9lu",
				(unsigned long) memKey, (unsigned long) shmid);
		AddToDataDirLockFile(LOCK_FILE_LINE_SHMEM_KEY, line);
	}

从源码可以看出，shmem信息保存在postmaster.pid文件第七行，分别写的是shmkey和shmid。

> cat postmaster.pid 
242712
/data
1772698474
8531
/tmp
0.0.0.0
     4096 143917078   # <----here
ready

什么是shmkey和shmid
#

在pg源码中是这样调用的，InternalIpcMemoryCreate()：

			shmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection);

PG以shmkey/memkey为种子key，向内核申请shmem并返回唯一标识符shmid。

shmid高度依赖服务器或者说服务器内存的状态。对于PG来说，快速重启实例，前后的shmid可能会相同或者+1，这跟linux内核机制相关；服务器重启那就完全不一样。

可以这样增加理解度：无论服务器是否重启，shmkey/memkey都可以是固定值，因为毕竟是用户输入（即PG）；而在服务器重启前后，即便传入同一shmkey，获取的shmid不太可能是同一值。

PG是怎么拿shmkey的
#

PGSharedMemoryCreate()：

	/*
	 * We use the data directory's ID info (inode and device numbers) to
	 * positively identify shmem segments associated with this data dir, and
	 * also as seeds for searching for a free shmem key.
	 */
	if (stat(DataDir, &statbuf) < 0)
		ereport(FATAL,
				(errcode_for_file_access(),
				 errmsg("could not stat data directory \"%s\": %m",
						DataDir)));
...
	/*
	 * Loop till we find a free IPC key.  Trust CreateDataDirLockFile() to
	 * ensure no more than one postmaster per data directory can enter this
	 * loop simultaneously.  (CreateDataDirLockFile() does not entirely ensure
	 * that, but prefer fixing it over coping here.)
	 */
	NextShmemSegID = statbuf.st_ino;

	for (;;)
	{
		IpcMemoryId shmid;
		PGShmemHeader *oldhdr;
		IpcMemoryState state;

		/* Try to create new segment */
		memAddress = InternalIpcMemoryCreate(NextShmemSegID, sysvsize);
		if (memAddress)
			break;				/* successful create and attach */

		/* Check shared memory and possibly remove and recreate */

		/*
		 * shmget() failure is typically EACCES, hence SHMSTATE_FOREIGN.
		 * ENOENT, a narrow possibility, implies SHMSTATE_ENOENT, but one can
		 * safely treat SHMSTATE_ENOENT like SHMSTATE_FOREIGN.
		 */
		shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);

PG通过stat获取datadir的状态，其中包含datadir的inode，PG直接将datadir.inode当作shmkey。

在PG中shmem key跟datadir的inode强相关，一般情况下shmem key=datadir inode

验证示例：

> ls -id $PGDATA
4096 /lzlcloud/pg8574/data
> cat postmaster.pid |head -7|tail -1
     4096 143917090

可以看到datadir.inode=shmkey=4096。

PG在云环境下的shmkey
#

上面说一般情况shmid=datadir.inode，实际上这在云环境中基本不是这个情况。

我们的云环境：

>  ls -id /lzlcloud/pg8298/data
4096 /lzlcloud/pg8298/data
>  ls -id /lzlcloud/pg8388/data
4096 /lzlcloud/pg8388/data
>  ls -id /lzlcloud/pg8095/data
4096 /lzlcloud/pg8095/data

>  cat /lzlcloud/pg8298/data/postmaster.pid|head -7|tail -1
     4096 971833391
>  cat /lzlcloud/pg8388/data/postmaster.pid|head -7|tail -1
     4097  62128161
>  cat /lzlcloud/pg8095/data/postmaster.pid|head -7|tail -1
     4098 143163441

data盘dir的inode都是4096，而shmkey是4096、4097、4098

why?

inode的问题跟磁盘的文件系统有关系：

每个文件系统有独立的inode
文件系统预留了一些inode，前几位是不能使用的。根据不同的挂载方式，我们data盘真正的inode从4096开始

也就是说datadir.inode=4096这是我们云环境磁盘挂载的默认行为。其他环境可能不一样，未深入分析。不过以相同文件系统和相同方式挂载挂载pg datadir的话，仍有可能inode数值相等。

shmkey的问题跟PG源码相关,PGSharedMemoryCreate()：

	for (;;)
	{	
        ...
        NextShmemSegID = statbuf.st_ino;
        ...
		shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);
		...
		switch (state)
		{
			case SHMSTATE_FOREIGN:
				NextShmemSegID++;
				break;

本来shmkey=datadir.inode，但是由于可能申请到shmem是foreign的，所以shmkey+1再申请一次。

例如postmaster.pid文件shmkey=4097的那个实例，它起库时shmkey=4096，但是发现shmid那个内存段被其他实例使用了（就是另一个shmkey=4096的PG实例），它让shmkey+1再申请了另一个shmid共享内存段。

同理shmkey=4098的那个实例加了2次才找到空闲的shmkey对应的shmid。

shmid的关联性
#

sysv的shmid可以在起库的报错日志、postmaster.pid文件第7行、虚拟内存地址smaps中均可以找到，并通过sysv共享内存命令的ipcs命令查看和ipcrm命令清理。

示例：注意以下shmid=143917078

起库报错日志：

pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 16:02:19 CST::@:[262388]: FATAL:  pre-existing shared memory block (key 4096, ID 143917078) is still in use

postmaster.pid文件第七行：

> cat postmaster.pid |head -7|tail -1
     4096 143917078

虚拟内存smaps：

cat /proc/`head -1 $PGDATA/postmaster.pid`/smaps | grep -E "\-s"  
2ad2b5189000-2ad2b518a000 rw-s 00000000 00:04 143917078                  /SYSV00001000 (deleted)

通过shmid sysv共享内存id查看和清理：

ipcs -m -i  143917078  #清理：ipcrm -m shmid

Shared memory Segment shmid=143917078
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=242712     cpid=242712     nattch=10
att_time=Thu Mar  5 16:14:51 2026  
det_time=Thu Mar  5 16:14:49 2026  
change_time=Thu Mar  5 16:14:34 2026  

测试
#

生产问题复现
#

持有一个backend进程永不退出，kill -9 PM

> cat postmaster.pid
      4096 143917076
      
> ipcs -m -i  143917076  #shmem id
Shared memory Segment shmid=143917076
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=241567     cpid=64757      nattch=23 

> kill -stop 107648 #任意一个backend

> kill -9 64757 #postmaster或者其他的

> ipcs -m -i  143917076
Shared memory Segment shmid=143917076
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=252283     cpid=64757      nattch=1   #nattch != 0

> pg_ctl start -D $PGDATA
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 16:02:19 CST::@:[262388]: FATAL:  pre-existing shared memory block (key 4096, ID 143917076) is still in use
2026-03-05 16:02:19 CST::@:[262388]: HINT:  Terminate any old server processes associated with data directory "/data".
 stopped waiting
pg_ctl: could not start server

nattach=1，实例无法启动。

实例奔溃正常起库
#

其实就是kill实例然后启动

> cat postmaster.pid     
     4096 143917077
     
> ipcs -m -i  143917077 #shmem id
Shared memory Segment shmid=143917077
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=154800     cpid=134329     nattch=18

> kill -9 134329 #postmaster或者其他的

> cat postmaster.pid     
     4096 143917077

> ipcs -m -i  143917077 #shmem id没有改变，shmem仍然存在
Shared memory Segment shmid=143917077
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=169360     cpid=134329     nattch=0 #nattch=0

> ipcs -m -i  143917077 #shmem id没有改变，shmem仍然存在

> pg_ctl start -D $PGDATA  # 起库成功
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 16:14:34 CST::@:[242712]: LOG:  redirecting log output to logging collector process
2026-03-05 16:14:34 CST::@:[242712]: HINT:  Future log output will appear in directory "/data/pg_log".
 done
server started

> ipcs -m -i  143917077 #残留的shmem起库时被清理
ipcs: id 143917077 not found
> ipcs -m -i  143917078 #shmemid起库时被+1
Shared memory Segment shmid=143917078
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=273571     cpid=242712     nattch=26

> cat postmaster.pid # shmkey不变，shmid+1
     4096 143917078

正常kill -9然后启动，可以正常启动，残留的shmem会在启动时被清理。shmkey不变是因为inode=4096且shmkey=4096没有被占用，shmid+1这是linux内核行为，至少说明不是使用的同一段shmem。

持有文件但不持有shmem
#

因为起库跟datadir inode相关，inode跟shmem id相关，起库本质上是在检查shmem是不是被其他进程持有，而不是文件fd是否还被其他进程持有。所以这里测试不持有共享内存但持有文件fd的进程logger。

$ cat /proc/77300/smaps | grep -E "\-s"  #这是logger进程，检查它没有用共享内存
$ kill -stop 77300  #stop logger
$ kill -9 77076  #kill -9 pm
$ cat postmaster.pid   #文件仍在
77076
/lzlcloud/pg8531/data
1772700343
8531
/tmp
0.0.0.0
     4096 143917080
ready   
$ ipcs -m -i 143917080  #共享内存仍在

Shared memory Segment shmid=143917080
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=77319      cpid=77076      nattch=0
att_time=Thu Mar  5 17:27:11 2026  
det_time=Thu Mar  5 17:27:15 2026  
change_time=Thu Mar  5 16:45:43 2026  

$ ps -ef|grep 77300  #进程仍在
postgres  77300      1  0 16:45 ?        00:00:00 postgresql: lzldb: logger 
postgres 135246  46622  0 17:27 pts/1    00:00:00 grep --color=auto 77300
$ pg_ctl start -D $PGDATA  #起库成功
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 17:27:55 CST::@:[140497]: LOG:  redirecting log output to logging collector process
2026-03-05 17:27:55 CST::@:[140497]: HINT:  Future log output will appear in directory "/data/pg_log".
 done
server started

logger持有data目录下的文件，但不关联共享内存，不会阻止起库

删除postmaster.pid文件起库失败
#

流程跟上面差不多：持有1个backend进程，kill -9 PM，删除postmaster.pid文件，起库。

过程不贴了，结果是起库失败，报错如下：

waiting for server to start....2026-03-06 15:29:48 CST::@:[22475]: FATAL:  pre-existing shared memory block (key 4098, ID 171868173) is still in use
2026-03-06 15:29:48 CST::@:[22475]: HINT:  Terminate any old server processes associated with data directory "/data".
2026-03-06 15:29:48 CST::@:[22475]: LOG:  database system is shut down

可以看出，有僵尸进程持有shmem的情况下，即便删除包含shmid的postmaster.pid文件，PG仍然能找到对应的shmid。

关闭一个其他库，启动当前库
#

pg会分析2个地方shmid是否是当前的

以datadir.inode当作shmkey对应的shmid，或者shmkey++
postmaster.pid中的shmid

即便直接删除postmaster.pid，PG仍然可以知道shmem是不是被其他进程持有。但是我们可以通过datadir.inode和shmkey++的特性让他起库。

因为根据之前分析，我们云环境datadir inode都是4096，shmkey不同是因为源码有shmkey++的逻辑。所以我们可以：启动或停止一个datadir.inode=4096的PG库，让当前PG库启动时shmkey++多一个或者少一个，拿到不同的shmid。

$ kill -stop 165245 
$ kill -9 164411  #停当前库并持有一个当前库backend进程

$ pg_ctl stop -D  /pg8531/data  # 停一个其他库
waiting for server to shut down.... done
server stopped
$ pg_ctl start -D /pg8574/data # 启动当前库，会失败，因为postmaster.pid没有删除
rase_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-05 18:22:35 CST::@:[196209]: FATAL:  pre-existing shared memory block (key 4097, ID 143917087) is still in use
2026-03-05 18:22:35 CST::@:[196209]: HINT:  Terminate any old server processes associated with data directory "/pg8574/data".
 stopped waiting
rase_ctl: could not start server
Examine the log output.

$ mv /lzlcloud/pg8574/data/postmaster.pid{,.bak}  # 删除当前库的postmaster.pid
$ pg_ctl start -D /lzlcloud/pg8574/data  #再起当前库，成功
2026-03-05 18:23:09 CST::@:[207725]: LOG:  redirecting log output to logging collector process
2026-03-05 18:23:09 CST::@:[207725]: HINT:  Future log output will appear in directory "/lzlcloud/pg8574/data/pg_log".
 done
server started

$ ipcs -m -i 143917087 #shmid对应的sysv共享内存仍然被我们持有

Shared memory Segment shmid=143917087
uid=6001        gid=6001        cuid=6001       cgid=6001
mode=0600       access_perms=0600
bytes=56        lpid=196209     cpid=164411     nattch=1
att_time=Thu Mar  5 18:22:35 2026  
det_time=Thu Mar  5 18:22:35 2026  
change_time=Thu Mar  5 18:21:04 2026  

可以启动，当前库共享内存申请了另一块，之前那个共享内存没有被清理。这就是在云环境下关其他库启动当前库的骚操作。

这里有个小小的前提，关的其他库不仅要inode=当前库inode，还要其他库shmkey<当前库shmkey。

报错分析：`lock file "postmaster.pid" already exists`
#

这个问题比“共享内存已存在”简单多了。

起库时本身就会检查lock file、lock file中的pid，CreateLockFile():

		if (other_pid != my_pid && other_pid != my_p_pid &&
			other_pid != my_gp_pid)
		{
			if (kill(other_pid, 0) == 0 ||
				(errno != ESRCH && errno != EPERM))
			{
				/* lockfile belongs to a live process */
				ereport(FATAL,
						(errcode(ERRCODE_LOCK_FILE_EXISTS),
						 errmsg("lock file \"%s\" already exists",
								filename),
						 isDDLock ?
						 (encoded_pid < 0 ?
						  errhint("Is another postgres (PID %d) running in data directory \"%s\"?",
								  (int) other_pid, refName) :
						  errhint("Is another postmaster (PID %d) running in data directory \"%s\"?",
								  (int) other_pid, refName)) :
						 (encoded_pid < 0 ?
						  errhint("Is another postgres (PID %d) using socket file \"%s\"?",
								  (int) other_pid, refName) :
						  errhint("Is another postmaster (PID %d) using socket file \"%s\"?",
								  (int) other_pid, refName))));
			}
		}

测试就更简单，在库启动的时候再启动一次：

$ pg_ctl start -D /pg8531/data
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2026-03-06 15:59:05 CST::@:[89145]: FATAL:  lock file "postmaster.pid" already exists
2026-03-06 15:59:05 CST::@:[89145]: HINT:  Is another postmaster (PID 255500) running in data directory "/pg8531/data"?
 stopped waiting
pg_ctl: could not start server
Examine the log output.

所以故障时的start.log后面几个报错是因为库已经启动了，多启动了几次。

总结
#

PG在起库时，会先开辟一个sysv shmem（不是mmap对应的share buffers）以锁定datadir。锁定是通过datadir的inode号当作shmkey通过shmget申请的，并返回shmem唯一标识符shmid。由于可能申请的shmem被其他进程使用，PG会让shmkey++无限循环指到申请到没有被人占用的shmem。postmaster.pid第七行分别保存shmkey和shmid。在云环境下通常可以看到共享PG实例的shmkey递增的现象，这是因为data盘挂载方式相同使用了相同的inode，shmkey++导致。

如果PG实例被意外干掉，shmem不会被清理，正常情况下没有僵尸进程持有共享内存，那么起库会清理这段shmem并正常起库；异常情况下僵尸进程持有共享内存，起库会失败，此时需要介入处理。

推荐的处理方式：

ipcrm -m（最推荐）
lsof找到僵尸进程并kill
重启主机

不推荐但可以起库的方式：

mv postmaster.pid+关闭一个其他PG库（其他PG库的shmkey<当前PG库）
mv postmaster.pid+重新挂载data盘并改变inode

最后回答开头的问题：

为什么这种场景在现实种不算太多见？

实例异常宕机+仍然有僵尸进程没有被清理。有些情况是异常宕机没有僵尸进程，正常起库就行了。

start.log起库报错有2类，分别对应什么操作和逻辑？

共享内存被占用的报错是因为实例异常宕机+仍然有僵尸进程；postmaster.pid存在的报错是因为起库多次

如果PM都不在了共享内存还可以存在吗？

PM都不在了共享内存可以存在，PG的进程不一定会自己跑挂或者被OS处理；但是所有进程都不在了共享内存应该不存在

这段共享内存如何定位和清理？

起库的start.log可以找到shmid，ipcrm -m $shmid命令可以清理。

PG共享内存有多段，这段共享内存是哪一段？

sysv shmem，用于保护datadir，一定存在，参考“三种共享内存”部分。与mmap下的sharebuffers是2个东西。

可以通过inode或者文件找到对应的shmem吗？

LINUX在用户态没有提供通过inode或者文件找到对应shmem的接口（这句话AI含量100%，经过多个模型交叉验证）。PG是通过datadir的inode当作种子shmkey去申请的shmem共享内存，本质上不是通过inode直接找到对应的shmem，PG对shmem共享内存使用自己的寻找机制，但不绝对对应，shkey++就是一个折衷起库逻辑。

问题现象#

报错分析：pre-existing shared memory block#

3种共享内存#

源码分析#

sysv shmem#

什么是shmkey和shmid#

PG是怎么拿shmkey的#

PG在云环境下的shmkey#

shmid的关联性#

测试#

生产问题复现#

实例奔溃正常起库#

持有文件但不持有shmem#

删除postmaster.pid文件起库失败#

关闭一个其他库，启动当前库#

报错分析：lock file "postmaster.pid" already exists#

总结#

相关文章