Analyzing a 5MB SQL That Consumed 70GB of Memory

Table of Contents

Process Memory Analysis
#

"WAL writer process (PID 66902) was terminated by signal 6: Aborted",,,,,,,,,"","postmaster"

The log shows postmaster process 66902 was killed.

Checking OS-level process memory: since top doesn’t show PPID and ps doesn’t show USS, we need both:

USER        PID   PPID PRI %CPU %MEM    VSZ   RSS WCHAN                                      S  STARTED     TIME COMMAND
postgres 211276  66478  19  8.7 10.6 57488380 56389972 -                                     R 17:13:03 00:02:47 postgres: BIND
postgres 211277  66478  19  7.8  9.6 52294700 51127480 -                                     R 17:13:03 00:02:31 postgres: BIND
postgres 222749  66478  19 22.7  9.3 51320000 49073368 -                                     R 17:35:33 00:02:09 postgres: BIND
postgres  39513  66478  19  2.9  6.8 38651084 36354736 ep_poll                               S 16:13:03 00:02:43 postgres: idle

Using PPID to identify high-memory backend processes. Let’s examine process 211276:

[postgres@lzl]$ zcat /osw/oswtop/toposw.dat.gz |grep 211276

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
211276 postgres  20   0 3271756   1.1g   1.1g S   7.3  0.2   0:03.93 postgres
211276 postgres  20   0 3291784   1.3g   1.2g R  96.4  0.2   0:11.87 postgres
211276 postgres  20   0 7369628   6.0g   2.1g R 100.0  1.2   0:46.58 postgres
211276 postgres  20   0   17.0g  15.9g   2.1g R 100.0  3.2   1:16.70 postgres
211276 postgres  20   0   28.8g  27.7g   2.1g R 100.0  5.5   1:46.82 postgres
211276 postgres  20   0   41.4g  40.4g   2.1g R 100.0  8.0   2:16.99 postgres
211276 postgres  20   0   54.7g  53.7g   2.1g R  88.8 10.7   2:47.60 postgres
211276 postgres  20   0   66.5g  64.9g   2.1g R  34.7 12.9   3:22.76 postgres
211276 postgres  20   0   71.0g  68.2g   2.1g R  99.1 13.6   3:52.94 postgres
211276 postgres  20   0   74.9g  71.2g   2.1g R 100.0 14.2   4:23.05 postgres
211276 postgres  20   0       0      0      0 R 100.0  0.0   4:45.65 postgres

We can estimate private memory via RES - SHR = USS. Process 211276’s memory ballooned from ~1GB to ~70GB within minutes, then crashed. All memory growth was private process memory.

SQL Analysis
#

The PostgreSQL log shows a 5MB SQL containing 5,000+ UNION ALLs and 30,000+ bind variables.

The execution plan is over 70,000 lines long:

Append  (cost=218196.51..218216.28 rows=1318 width=1628)
  InitPlan 1 (returns $0)
    ->  Index Scan using table1 on table1nfo  (cost=0.29..5.31 rows=1 width=40)
          Index Cond: ((col1)::text = 'xxx'::text)
          Filter: ((colcolcol)::text = 'xxx'::text)
  InitPlan 2 (returns $1)
    ->  Index Scan using table1 on table1nfo table1nfo_1  (cost=0.29..5.31 rows=1 width=40)
          Index Cond: ((col1)::text = 'xxx'::text)
          Filter: ((colcolcol)::text = 'xxx'::text)
  ...
  InitPlan 10544 (returns $10543)
    ->  Aggregate  (cost=5.58..5.59 rows=1 width=32)
          ->  Index Scan using table2 on table2col t_1317  (cost=0.56..5.58 rows=1 width=19)
                Index Cond: ((ididid)::text = 'xxx'::text)
                Filter: ((idididid)::text = '1'::text)

The plan structure is simple: ~10,000 sub-plans fetching data, then an Append to combine results.

This SQL monstrosity pushed a single backend process to 70GB. The root cause is clear: reduce the UNION ALLs and the problem goes away (which is indeed what happened). But if we dig deeper, many interesting questions arise:

Why did a 5MB SQL consume 70GB of memory?
Is the data itself related to memory usage? Was it caused by returning too many rows?
Is the memory from parsing cache or plan cache?
Why didn’t work_mem limit the operation memory, even though it’s set to a reasonable value?

Initial Analysis
#

A 5MB SQL cached in a backend would at minimum contain: metadata, parsed SQL, and plan cache information.

We’ve seen cases before where metadata cache (relcache) for hundreds of thousands of tables/partitions caused huge backend memory. But this database has few tables, so relcache can be preliminarily ruled out (later confirmed by memory dump).

Parsed SQL data shouldn’t be that large — a 5MB SQL parsed shouldn’t produce 70GB.

work_mem limitations and more:

work_mem only limits per-operation memory for sort and hash operations. This creates the “multiple sort/hash” problem: a single SQL with many sorts can use work_mem × N. PG 13 introduced hash_mem_multiplier to cap hash usage within one statement. But what about sorts? Currently no multiplier for sorts, though in practice it’s less of a problem — statements with dozens of sort nodes are rare, as they carry high cost, and the optimizer tends to place sorts late in the plan.

Here, work_mem is 128MB and the instance is PG 13+ with hash_mem_multiplier=1, so mass hash memory consumption can be ruled out. Furthermore, the execution plan above has zero sort or hash operations, confirming this is not a sort/hash issue.

So the earlier question: “Why didn’t work_mem limit operation memory?”

Because the SQL only has UNION ALL — no sort or hash operations at all. work_mem does not constrain memory here.

Other plan nodes:

No matter what, work_mem only (!) limits sort/hash. There are dozens of plan node types — are the rest all unconstrained?

Reproduction and Deep Analysis
#

Empty Table Reproduction
#

--Create empty table
 create table lzl1(col1 varchar(1));
--Query with many UNION ALLs
select col1 from lzl1 union all
select col1 from lzl1 union all 
...(5000 UNION ALLs, SQL size 150KB)
select col1 from lzl1

(Too many UNION ALLs may exceed max_stack_depth)

An empty table + many UNION ALLs immediately reproduces the memory spike. Moreover, after the SQL completes, the backend memory is reclaimed.

Since this is an empty table (0KB data file), we can rule out data as the cause. So: “Is the data itself related to memory? Was it caused by returning too many rows?” — No, data is not the main factor.

Strace System Call Analysis
#

While executing the SQL, capture system calls with strace -p:

 strace -p 198337 > strace.198337 2>&1

Quick primer on relevant Linux syscalls:

epoll_wait: Wait for an event. Idle processes sit in this state.
recvfrom: Receive a message from a socket.
getrusage: Get resource usage.
brk: Program break. Increasing it allocates memory to the process; decreasing it deallocates. malloc ultimately calls brk.
lseek: Reposition file offset.
write: Write to a file descriptor. Does not guarantee disk write.
sendto: Send a message on a socket.

Syscalls like lseek, write, sendto include fd (file descriptor) information:

lseek(37, 0, SEEK_END)                  = 0

/proc/[pid]/fd caches the process’s file descriptors. We can map an fd back to a relation — fd 37 is table lzl1:

[postgres@lzl]$ cd /proc/198337/fd
[postgres@lzl]$ ll 37
lrwx------ 1 postgres postgres 64 Jan 26 22:59 37 -> /pgdata/lzl/data13/base/16385/16386
[postgres@lzl]$ oid2name -d lzldb -f 16386
From database "lzldb":
  Filenode  Table Name
----------------------
     16386        lzl1

The strace output is dense but structurally simple:

strace: Process 198337 attached
epoll_wait(4, [{EPOLLIN, {u32=44314568, u64=44314568}}], 1, -1) = 1
## step1
recvfrom(9, "Q\0\2p\372select col1 from lzl1 union"..., 8192, 0, NULL, NULL) = 8192
recvfrom(9, " all\nselect col1 from lzl1 union"..., 8192, 0, NULL, NULL) = 8192
recvfrom(9, " all\nselect col1 from lzl1 union"..., 8192, 0, NULL, NULL) = 8192
...
recvfrom(9, " all\nselect col1 from lzl1 union"..., 8192, 0, NULL, NULL) = 8192
recvfrom(9, " all\nselect col1 from lzl1 union"..., 8192, 0, NULL, NULL) = 4347
## step2
brk(NULL)                               = 0x34d5000
brk(0x3cd5000)                          = 0x3cd5000
brk(NULL)                               = 0x3cd5000
...
brk(NULL)                               = 0x88cd6000
brk(0x894d6000)                         = 0x894d6000
## step3
lseek(37, 0, SEEK_END)                  = 0
lseek(37, 0, SEEK_END)                  = 0
...
lseek(37, 0, SEEK_END)                  = 0
## step4
brk(NULL)                               = 0x89cd6000
brk(0x8a4d6000)                         = 0x8a4d6000
brk(NULL)                               = 0x8a4d6000
...
brk(NULL)                               = 0x8a516000
brk(0x8a556000)                         = 0x8a556000
## step5
write(2, "2024-01-26 23:08:01.800 CST [198"..., 165521) = 165521
brk(NULL)                               = 0x8a556000
brk(0x8a57d000)                         = 0x8a57d000
brk(NULL)                               = 0x8a57d000
brk(0x8a59f000)                         = 0x8a59f000
...
brk(NULL)                               = 0x8d449000
brk(0x8d46b000)                         = 0x8d46b000
brk(NULL)                               = 0x8d46b000
brk(0x8d48d000)                         = 0x8d48d000
#step6
lseek(37, 0, SEEK_END)                  = 0
lseek(37, 0, SEEK_END)                  = 0
...
lseek(37, 0, SEEK_END)                  = 0
#step7
brk(NULL)                               = 0x8dcb1000
brk(NULL)                               = 0x8dcb1000
brk(0x8c179000)                         = 0x8c179000
brk(NULL)                               = 0x8c179000
brk(NULL)                               = 0x8c179000
brk(NULL)                               = 0x8c179000
brk(0x8a526000)                         = 0x8a526000
...
brk(0x34d5000)                          = 0x34d5000
brk(NULL)                               = 0x34d5000
#step8
sendto(8, "\2\0\0\0\230\0\0\0\1@\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 152, 0, NULL, 0) = 152
sendto(9, "T\0\0\0\35\0\1col1\0\0\0\0\0\0\0\0\0\4\23\377\377\0\0\0\5\0\0C\0"..., 50, 0, NULL, 0) = 50
#step9
recvfrom(9, 0xddcf60, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4, strace: Process 198337 detached
 <detached ...> 

Receive the UNION ALL SQL from fd=9 socket
brk allocates memory: process memory grows from 0x34d5000 (54MB) to 0x894d6000 (2.1GB)
lseek on table lzl1
Memory grows ~4MB
write to fd=2 (log file); memory grows ~48MB
lseek on table lzl1
Memory peaks at 0x8dcb1000 (2.1GB), then brk releases memory back down to 0x34d5000 (54MB) — exactly matching the start
Send result via socket
Receive empty message from fd=9

The strace doesn’t reveal much beyond the OS allocating and releasing memory for the process.

Memory Dump Analysis
#

pmap of the process during the memory spike:

[postgres@lzl pg_log]$ pmap -x 76207
76207:   postgres: postgres lzldb [local] SELECT            
Address           Kbytes     RSS   Dirty Mode  Mapping
0000000000400000    7984    2192       0 r-x-- postgres
0000000000dcc000       4       4       4 r---- postgres
0000000000dcd000      60      60      60 rw--- postgres
0000000000ddc000     200      60      60 rw---   [ anon ]
0000000001e49000     264     224     224 rw---   [ anon ]
0000000001e8b000 1812380 1804400 1804400 rw---   [ anon ]
...
ffffffffff600000       4       0       0 r-x--   [ anon ]
---------------- ------- ------- ------- 
total kB         2089384 1810232 1807384

pmap doesn’t label the segments, but we can see the largest segment starts at address 0x1e49000. Checking smaps for more detail:

[postgres@lzl 76207]$ cat smaps |grep 1e49000 -A 30
01e49000-01e8b000 rw-p 00000000 00:00 0                                  [heap]
Size:                264 kB
...
01e8b000-70872000 rw-p 00000000 00:00 0                                  [heap]
Size:            1812380 kB
Rss:             1804400 kB
Pss:             1804400 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   1804400 kB
Referenced:      1804400 kB
Anonymous:       1804400 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

Heap segment. PSS (private memory): 1.8GB!

(I tried using gdb to dump the 0x1e8b000-0x70872000 segment but it failed — not sure why. Suggestions welcome!)

Using gcore for a rough dump:

[postgres@lzl lzl]$ gcore -o /pgdata/lzl/gcore.dump 76207
[postgres@lzl lzl]$ strings gcore.dump.76207> text.dump.76207
[postgres@lzl lzl]$ ll -h
-rw-r-----  1 postgres postgres 2.0G Jan 26 17:29 gcore.dump.76207
-rw-r-----  1 postgres postgres 5.2M Jan 26 17:30 text.dump.76207

2GB virtual memory allocated, 1.8GB physical memory occupied — but only 5.2MB of actual data stored!

A rough hexdump reveals many memory holes:

[postgres@lzl lzl]$ hexdump -C gcore.dump.76207 |head -10000  |grep "00 00 00 00 00 00 00 00"|wc -l
3690

log_planner_stats and Other Info
#

To verify whether the plan cache is the culprit, enable logging for parse, planner, and executor phases:

 log_parser_stats = on
 log_planner_stats = on
 log_executor_stats = on

The logs show the parse phase uses little memory, while the planner consumes significantly more.

Planner stats log:

2024-01-26 18:01:41.592 CST [208503] LOG:  PLANNER STATISTICS
2024-01-26 18:01:41.592 CST [208503] DETAIL:  ! system usage stats:
        !       0.048955 s user, 0.004996 s system, 0.054077 s elapsed
        !       [11.208034 s user, 1.313838 s system total]
        !       2255352 kB max resident size
        !       0/0 [0/352] filesystem blocks in/out
        !       0/1315 [0/563859] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       0/0 [1/16] voluntary/involuntary context switches

2GB max resident size — consistent with the RES observed from the OS. This answers: “Is the memory from parsing cache or plan cache?” — The planner phase consumes the memory.

Inspecting TopMemoryContext
#

PostgreSQL manages backend private memory through MemoryContext. We can dump TopMemoryContext via gdb:

TopMemoryContext: 101488 total in 6 blocks; 48464 free (28 chunks); 53024 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used
  TopTransactionContext: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used
  TableSpace cache: 8192 total in 1 blocks; 2048 free (0 chunks); 6144 used
  RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used
  MessageContext: 1854981336 total in 235 blocks; 7911304 free (9 chunks); 1847070032 used
...
Grand total: 1856104056 bytes in 431 blocks; 8226712 free (179 chunks); 1847877344 used

MessageContext accounts for 1.8GB — the largest consumer.

From src/backend/utils/mmgr/README:

MessageContext — this context holds the current command message from the frontend, as well as any derived storage that need only live as long as the current message (for example, in simple-Query mode the parse and plan trees can live here). This context will be reset, and any children deleted, at the top of each cycle of the outer loop of PostgresMain. This is kept separate from per-transaction and per-portal contexts because a query string might need to live either a longer or shorter time than any single transaction or portal.

When creating a prepared statement, the parse and plan trees will be built in a temporary context that’s a child of MessageContext.

MessageContext caches messages from the frontend, including derived parse and plan tree data.
Parse and plan trees are children of MessageContext — when MessageContext is reclaimed, parse and plan trees are reclaimed too.

This also explains the private memory reclamation: the plan tree data produced during the planner phase is a child of MessageContext. Once results are returned, MessageContext is reset and all children are freed. This matches the strace observation where memory after release matches memory before allocation exactly.

Summary
#

Answering the final question: “Why did a 5MB SQL consume 70GB of memory?”

The overwhelming majority of memory was consumed during plan creation. The planner allocated enormous amounts of memory. work_mem and hash_mem_multiplier can only constrain sort and hash operations — they cannot limit other memory operations during planning. The plan tree itself isn’t that large, but the allocation process creates massive memory holes: megabyte-scale data (metadata, parse tree, plan tree, etc.) ends up stored in gigabyte-scale memory regions.

These SQL, parse tree, and plan tree structures are all cached in MessageContext and its children. Once the result is sent back to the client, all memory from this phase is reclaimed.

Process Memory Analysis#

SQL Analysis#

Initial Analysis#

Reproduction and Deep Analysis#

Empty Table Reproduction#

Strace System Call Analysis#

Memory Dump Analysis#

log_planner_stats and Other Info#

Inspecting TopMemoryContext#

Summary#

Related