Skip to main content
  1. Posts/

A Classic Case of Long Transaction, Table Bloat, and LIMIT Issues

·1309 words·7 mins
liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.
Table of Contents

Slow Primary Key Update — Problem Analysis
#

A simple primary key update took over 1 second to execute. Due to high concurrency, the CPU was completely maxed out:

2024-04-01 10:19:36.084 CST,"lzlopr","lzl",158751,"10.33.78.149:51502",66055a6b.26c1f,172,"UPDATE",2024-03-28 19:54:19 CST,528/19816630,970251337,LOG,00000,"duration: 1218.688 ms  plan:
Query Text:             update table_a                 set (omitted...)=$6           where column_id =$7
Update on table_a  (cost=0.40..5.49 rows=1 width=2774)
->  Index Scan using pk_id on table_a  (cost=0.40..5.49 rows=1 width=2774)
      Index Cond: ((column_id)::text = $7)",,,,,,,,,"PostgreSQL JDBC Driver","client backend"

The SQL itself is very simple — an update with a condition on the primary key. Looking at the execution plan, it used the pk_id primary key index, so there was no problem with the plan itself; the issue wasn’t a plan change.

Let’s rewrite the SQL (since it’s an UPDATE) and use explain (analyze,buffers) to compare the execution cost:

=> explain (analyze,buffers) select * from table_a where column_id='d4f713370e584820a9b15e2218ea436a';
                                                         QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table_a  (cost=2.91..5.42 rows=1 width=1156) (actual time=55.052..123.354 rows=1 loops=1)
 Recheck Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
 Heap Blocks: exact=1
 Buffers: shared hit=13870
 ->  Bitmap Index Scan on pk_id  (cost=0.00..2.91 rows=1 width=0) (actual time=3.464..3.465 rows=13866 loops=1)
       Index Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
       Buffers: shared hit=24
Planning:
 Buffers: shared hit=4261
Planning Time: 11.028 ms
Execution Time: 123.567 ms
(11 rows)

The actual execution plan is fine, but shared hit=13870 is clearly way too high. Normally, a primary key lookup shouldn’t scan that many pages. This strongly suggests table bloat.

Checking table bloat:

-- Table size \dt
Size        | 525 MB
-- Actual row count
count | 827
-- Dead tuples from pg_stat_all_tables
n_live_tup          | 786
n_dead_tup          | 657604

Only ~800 live tuples but 650K dead tuples! This explains why the primary key scan visited so many pages. But why weren’t the dead tuples reclaimed?

When a table exceeds the default 20% modification threshold, autovacuum triggers vacuum to reclaim space. We can see in the logs that autovacuum was indeed being triggered:

2024-04-01 14:13:46.649 CST,,,14081,,660a5099.3701,1,,2024-04-01 14:13:45 CST,259/17828993,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:13:47.801 CST,,,14081,,660a5099.3701,2,,2024-04-01 14:13:45 CST,259/17828994,971045014,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.01 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:14:46.673 CST,,,26136,,660a50d5.6618,1,,2024-04-01 14:14:45 CST,259/17829090,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:14:47.833 CST,,,26136,,660a50d5.6618,2,,2024-04-01 14:14:45 CST,259/17829091,971049759,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:15:46.680 CST,,,40743,,660a5111.9f27,1,,2024-04-01 14:15:45 CST,259/17829164,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:15:47.849 CST,,,40743,,660a5111.9f27,2,,2024-04-01 14:15:45 CST,259/17829165,971055464,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:16:46.677 CST,,,52599,,660a514d.cd77,1,,2024-04-01 14:16:45 CST,259/17829263,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:16:47.844 CST,,,52599,,660a514d.cd77,2,,2024-04-01 14:16:45 CST,259/17829264,971061382,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:17:46.699 CST,,,64858,,660a5189.fd5a,1,,2024-04-01 14:17:45 CST,234/16589539,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:17:47.851 CST,,,64858,,660a5189.fd5a,2,,2024-04-01 14:17:45 CST,234/16589540,971066091,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:18:46.703 CST,,,78112,,660a51c5.13120,1,,2024-04-01 14:18:45 CST,259/17829409,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:18:47.854 CST,,,78112,,660a51c5.13120,2,,2024-04-01 14:18:45 CST,259/17829410,971070390,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"		

Not only was it triggered, but the interval was exactly 1 minute. The default autovacuum_naptime is 1 minute:

>= show autovacuum_naptime ;
autovacuum_naptime 
--------------------
1min
(1 row)

We can conclude:

  • autovacuum was successfully triggered
  • Dead tuples either couldn’t be reclaimed fast enough — the dead tuples generated within 1 minute exceeded 20% (maybe 1 minute is too long); or they weren’t being reclaimed at all, guaranteeing the next autovacuum trigger

Let’s look at the detailed autovacuum output:

2024-04-01 10:22:44.648 CST,,,16827,,660a1a73.41bb,1,,2024-04-01 10:22:43 CST,170/16910186,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
pages: 0 removed, 48745 remain, 6 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077
buffer usage: 97603 hits, 0 misses, 5 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.028 MB/s
system usage: CPU: user: 0.21 s, system: 0.22 s, elapsed: 1.41 s
WAL usage: 4 records, 3 full page images, 5129 bytes",,,,,,,,,"","autovacuum worker"

autovacuum triggered but reclaimed nothing: tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077. oldest xmin represents the oldest transaction in the database — meaning there’s a long-running transaction. This is easy to find:

>=  select pid,usename,xact_start,state_change,wait_event,state,query from  pg_stat_activity where state<>'idle' order by xact_start ;
pid   |  usename   |          xact_start           |         state_change          |     wait_event      |        state        |                                                                              
--------+------------+-------------------------------+-------------------------------+---------------------+---------------------+------------------------------------------------------------------------------
164658 | phbdspsqp  | 2024-04-01 08:36:57.275408+08 | 2024-04-01 08:36:57.299609+08 | DataFileRead        | active              | SELECT "minval","maxval" FROM (select min(ID) as minval,max(TRACK

The long transaction was a SQL that had been running since around 8 AM that morning, for several hours. Even though it wasn’t on the same table, being the oldest xmin it still had an impact.

At this point the root cause is identified:

  • Table A had frequent updates, high bloat risk
  • A long transaction on table B prevented dead tuple reclamation on table A
  • Table A’s update statements scanned excessive pages

Solution:

  • Kill the long transaction: select pg_terminate_backend(164658)
  • Manually vacuum or wait 1 minute (or less) for automatic vacuum: vacuum table_a

After both steps were completed, checking dead tuples:

n_live_tup          | 707
n_dead_tup          | 298

650K dead tuples have been cleaned up.

Checking the execution plan again:

=>  explain (analyze,buffers) select * from table_a where column_id='d4f713370e584820a9b15e2218ea436a';
                                                                 QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using pk_id on table_a  (cost=0.40..5.44 rows=1 width=621) (actual time=0.026..0.029 rows=1 loops=1)
 Index Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
 Buffers: shared hit=6
Planning Time: 0.057 ms
Execution Time: 0.043 ms

Shared hits down to just 6 — issue resolved.

Additionally, vacuum only reclaims dead tuples but does not shrink the table — the table remains the same size. Space can only be returned to the OS when new data reuses those pages, or through a repack/table rebuild:

Size        | 525 MB

Bonus SQL Optimization — ORDER BY LIMIT
#

That long-running transaction SQL also had its own problems… The business reported it ran fast a few days ago but took several hours today:

explain select min(ID) as minval,max(ID) as maxval from table_b where time_at >= to_timestamp('2024-03-30 00:00:00','yyyy-MM-dd HH24:mi:ss');
                                                                           QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=4298.54..4298.55 rows=1 width=64)
 InitPlan 1 (returns $0)
   ->  Limit  (cost=0.70..2149.27 rows=1 width=32)
         ->  Index Scan using pk_b on table_b  (cost=0.70..1181490202.27 rows=549896 width=32)
               Index Cond: ((ID)::text IS NOT NULL)
               Filter: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))
 InitPlan 2 (returns $1)
   ->  Limit  (cost=0.70..2149.27 rows=1 width=32)
         ->  Index Scan Backward using pk_b on table_b table_b_1  (cost=0.70..1181490202.27 rows=549896 width=32)
               Index Cond: ((ID)::text IS NOT NULL)
               Filter: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))

The SQL is also simple — only one condition on a time column, with decent selectivity. However, this SQL did not use the time_at index but instead used the ID primary key index. This is the same LIMIT problem. Running ANALYZE is useless here — it’s better to rewrite the SQL.

After rewriting, the result came back instantly:

explain select min(ID||'') as minval,max(ID||'') as maxval from table_b where time_at >= to_timestamp('2024-03-30 00:00:00','yyyy-MM-dd HH24:mi:ss')

                                                       QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1201418.86..1201418.87 rows=1 width=64)
 ->  Index Scan using idx_time_at on table_b  (cost=0.57..1195919.90 rows=549896 width=33)
       Index Cond: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))

This isn’t really an execution plan regression, because the plan didn’t change. A few days ago it had the same plan but ran fast — the reason is tied to data distribution and the LIMIT mechanism: when data is quickly found, it returns immediately (which is why the optimizer chose the primary key index); when it’s “unlucky” and the matching data is far away, it takes a very long time.

Summary
#

A classic case:

  • A small table with frequent updates
  • A long transaction preventing dead tuple reclamation
  • The long transaction itself was caused by an index selection problem due to sorting and LIMIT operations (ORDER BY, MAX/MIN, GROUP can all trigger this)

One incident, three classic PostgreSQL knowledge points — quite representative.

Related

Analyzing a 5MB SQL That Consumed 70GB of Memory

·2217 words·11 mins
Process Memory Analysis # "WAL writer process (PID 66902) was terminated by signal 6: Aborted",,,,,,,,,"","postmaster" The log shows postmaster process 66902 was killed. Checking OS-level process memory: since top doesn’t show PPID and ps doesn’t show USS, we need both: USER PID PPID PRI %CPU %MEM VSZ RSS WCHAN S STARTED TIME COMMAND postgres 211276 66478 19 8.7 10.6 57488380 56389972 - R 17:13:03 00:02:47 postgres: BIND postgres 211277 66478 19 7.8 9.6 52294700 51127480 - R 17:13:03 00:02:31 postgres: BIND postgres 222749 66478 19 22.7 9.3 51320000 49073368 - R 17:35:33 00:02:09 postgres: BIND postgres 39513 66478 19 2.9 6.8 38651084 36354736 ep_poll S 16:13:03 00:02:43 postgres: idle Using PPID to identify high-memory backend processes. Let’s examine process 211276:

Case Study: Analyzing Occasional Slow INSERT VALUES

·2017 words·10 mins
The business team reported that INSERT VALUES occasionally became slow. By the time I checked the active sessions, the slow write problem had already subsided. Later, I discovered that the slow write problem lasted less than half a minute, with INSERT VALUES taking 1-2 seconds. I wrote a script to capture active session information and managed to get the session data: wait_event | count ---------------------+------- [null] | 11 WALRead | 1 DataFileRead | 2 BgWriterMain | 1 WALWrite | 40 AutoVacuumMain | 1 ClientRead | 385 LogicalLauncherMain | 1 The most abnormal wait event was WALWrite with 40 sessions.

Case Study: Logical Replication Deadlocks Checkpoint, Walsender, and Backup

·2074 words·10 mins
Problem Symptoms # The backup process (pg_start_backup()) was blocked by the checkpointer, and the checkpointer was blocked by the logical replication walsender. The database was still serving queries, but backup, checkpoint, and logical replication were all completely hung. Two processes in pg_stat_activity showed an unusual wait event: replication_slot_io. [postgres@hostlzl:6666/postgres][04-08.16:50:28]=> select * from pg_stat_activity where pid=173038 \gx -[ RECORD 1 ]----+------------------------------ datid | 17630 datname | lzldb pid | 173038 usesysid | 35157 usename | repuser application_name | PostgreSQL JDBC Driver client_addr | 30.88.75.58 client_hostname | [null] client_port | 37623 backend_start | 2024-04-02 11:40:07.75022+08 xact_start | [null] query_start | [null] state_change | 2024-04-02 11:40:07.764475+08 wait_event_type | LWLock wait_event | replication_slot_io state | active backend_xid | [null] backend_xmin | [null] query | backend_type | walsender Time: 6.658 ms [postgres@hostlzl:6666/postgres][04-08.16:50:34]=> select * from pg_stat_activity where pid=12729\gx -[ RECORD 1 ]----+------------------------------ datid | [null] datname | [null] pid | 12729 usesysid | [null] usename | [null] application_name | client_addr | [null] client_hostname | [null] client_port | [null] backend_start | 2024-04-02 11:23:03.343116+08 xact_start | [null] query_start | [null] state_change | [null] wait_event_type | LWLock wait_event | replication_slot_io state | [null] backend_xid | [null] backend_xmin | [null] query | backend_type | checkpointer One walsender and one checkpointer. Both were started on April 2. Let’s check the walsender 173038 logs: