Table of Contents

Slow Primary Key Update — Problem Analysis
#

A simple primary key update took over 1 second to execute. Due to high concurrency, the CPU was completely maxed out:

2024-04-01 10:19:36.084 CST,"lzlopr","lzl",158751,"10.33.78.149:51502",66055a6b.26c1f,172,"UPDATE",2024-03-28 19:54:19 CST,528/19816630,970251337,LOG,00000,"duration: 1218.688 ms  plan:
Query Text:             update table_a                 set (omitted...）=$6           where column_id =$7
Update on table_a  (cost=0.40..5.49 rows=1 width=2774)
->  Index Scan using pk_id on table_a  (cost=0.40..5.49 rows=1 width=2774)
      Index Cond: ((column_id)::text = $7)",,,,,,,,,"PostgreSQL JDBC Driver","client backend"

The SQL itself is very simple — an update with a condition on the primary key. Looking at the execution plan, it used the pk_id primary key index, so there was no problem with the plan itself; the issue wasn’t a plan change.

Let’s rewrite the SQL (since it’s an UPDATE) and use explain (analyze,buffers) to compare the execution cost:

=> explain (analyze,buffers) select * from table_a where column_id='d4f713370e584820a9b15e2218ea436a';
                                                         QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table_a  (cost=2.91..5.42 rows=1 width=1156) (actual time=55.052..123.354 rows=1 loops=1)
 Recheck Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
 Heap Blocks: exact=1
 Buffers: shared hit=13870
 ->  Bitmap Index Scan on pk_id  (cost=0.00..2.91 rows=1 width=0) (actual time=3.464..3.465 rows=13866 loops=1)
       Index Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
       Buffers: shared hit=24
Planning:
 Buffers: shared hit=4261
Planning Time: 11.028 ms
Execution Time: 123.567 ms
(11 rows)

The actual execution plan is fine, but shared hit=13870 is clearly way too high. Normally, a primary key lookup shouldn’t scan that many pages. This strongly suggests table bloat.

Checking table bloat:

-- Table size \dt
Size        | 525 MB

-- Actual row count
count | 827

-- Dead tuples from pg_stat_all_tables
n_live_tup          | 786
n_dead_tup          | 657604

Only ~800 live tuples but 650K dead tuples! This explains why the primary key scan visited so many pages. But why weren’t the dead tuples reclaimed?

When a table exceeds the default 20% modification threshold, autovacuum triggers vacuum to reclaim space. We can see in the logs that autovacuum was indeed being triggered:

2024-04-01 14:13:46.649 CST,,,14081,,660a5099.3701,1,,2024-04-01 14:13:45 CST,259/17828993,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:13:47.801 CST,,,14081,,660a5099.3701,2,,2024-04-01 14:13:45 CST,259/17828994,971045014,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.01 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:14:46.673 CST,,,26136,,660a50d5.6618,1,,2024-04-01 14:14:45 CST,259/17829090,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:14:47.833 CST,,,26136,,660a50d5.6618,2,,2024-04-01 14:14:45 CST,259/17829091,971049759,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:15:46.680 CST,,,40743,,660a5111.9f27,1,,2024-04-01 14:15:45 CST,259/17829164,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:15:47.849 CST,,,40743,,660a5111.9f27,2,,2024-04-01 14:15:45 CST,259/17829165,971055464,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:16:46.677 CST,,,52599,,660a514d.cd77,1,,2024-04-01 14:16:45 CST,259/17829263,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:16:47.844 CST,,,52599,,660a514d.cd77,2,,2024-04-01 14:16:45 CST,259/17829264,971061382,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:17:46.699 CST,,,64858,,660a5189.fd5a,1,,2024-04-01 14:17:45 CST,234/16589539,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:17:47.851 CST,,,64858,,660a5189.fd5a,2,,2024-04-01 14:17:45 CST,234/16589540,971066091,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:18:46.703 CST,,,78112,,660a51c5.13120,1,,2024-04-01 14:18:45 CST,259/17829409,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:18:47.854 CST,,,78112,,660a51c5.13120,2,,2024-04-01 14:18:45 CST,259/17829410,971070390,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"		

Not only was it triggered, but the interval was exactly 1 minute. The default autovacuum_naptime is 1 minute:

>= show autovacuum_naptime ;
autovacuum_naptime 
--------------------
1min
(1 row)

We can conclude:

autovacuum was successfully triggered
Dead tuples either couldn’t be reclaimed fast enough — the dead tuples generated within 1 minute exceeded 20% (maybe 1 minute is too long); or they weren’t being reclaimed at all, guaranteeing the next autovacuum trigger

Let’s look at the detailed autovacuum output:

2024-04-01 10:22:44.648 CST,,,16827,,660a1a73.41bb,1,,2024-04-01 10:22:43 CST,170/16910186,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
pages: 0 removed, 48745 remain, 6 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077
buffer usage: 97603 hits, 0 misses, 5 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.028 MB/s
system usage: CPU: user: 0.21 s, system: 0.22 s, elapsed: 1.41 s
WAL usage: 4 records, 3 full page images, 5129 bytes",,,,,,,,,"","autovacuum worker"

autovacuum triggered but reclaimed nothing: tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077. oldest xmin represents the oldest transaction in the database — meaning there’s a long-running transaction. This is easy to find:

>=  select pid,usename,xact_start,state_change,wait_event,state,query from  pg_stat_activity where state<>'idle' order by xact_start ;
pid   |  usename   |          xact_start           |         state_change          |     wait_event      |        state        |                                                                              
--------+------------+-------------------------------+-------------------------------+---------------------+---------------------+------------------------------------------------------------------------------
164658 | phbdspsqp  | 2024-04-01 08:36:57.275408+08 | 2024-04-01 08:36:57.299609+08 | DataFileRead        | active              | SELECT "minval","maxval" FROM (select min(ID) as minval,max(TRACK

The long transaction was a SQL that had been running since around 8 AM that morning, for several hours. Even though it wasn’t on the same table, being the oldest xmin it still had an impact.

At this point the root cause is identified:

Table A had frequent updates, high bloat risk
A long transaction on table B prevented dead tuple reclamation on table A
Table A’s update statements scanned excessive pages

Solution:

Kill the long transaction: select pg_terminate_backend(164658)
Manually vacuum or wait 1 minute (or less) for automatic vacuum: vacuum table_a

After both steps were completed, checking dead tuples:

n_live_tup          | 707
n_dead_tup          | 298

650K dead tuples have been cleaned up.

Checking the execution plan again:

=>  explain (analyze,buffers) select * from table_a where column_id='d4f713370e584820a9b15e2218ea436a';
                                                                 QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using pk_id on table_a  (cost=0.40..5.44 rows=1 width=621) (actual time=0.026..0.029 rows=1 loops=1)
 Index Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
 Buffers: shared hit=6
Planning Time: 0.057 ms
Execution Time: 0.043 ms

Shared hits down to just 6 — issue resolved.

Additionally, vacuum only reclaims dead tuples but does not shrink the table — the table remains the same size. Space can only be returned to the OS when new data reuses those pages, or through a repack/table rebuild:

Size        | 525 MB

Bonus SQL Optimization — ORDER BY LIMIT
#

That long-running transaction SQL also had its own problems… The business reported it ran fast a few days ago but took several hours today:

explain select min(ID) as minval,max(ID) as maxval from table_b where time_at >= to_timestamp('2024-03-30 00:00:00','yyyy-MM-dd HH24:mi:ss');
                                                                           QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=4298.54..4298.55 rows=1 width=64)
 InitPlan 1 (returns $0)
   ->  Limit  (cost=0.70..2149.27 rows=1 width=32)
         ->  Index Scan using pk_b on table_b  (cost=0.70..1181490202.27 rows=549896 width=32)
               Index Cond: ((ID)::text IS NOT NULL)
               Filter: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))
 InitPlan 2 (returns $1)
   ->  Limit  (cost=0.70..2149.27 rows=1 width=32)
         ->  Index Scan Backward using pk_b on table_b table_b_1  (cost=0.70..1181490202.27 rows=549896 width=32)
               Index Cond: ((ID)::text IS NOT NULL)
               Filter: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))

The SQL is also simple — only one condition on a time column, with decent selectivity. However, this SQL did not use the time_at index but instead used the ID primary key index. This is the same LIMIT problem. Running ANALYZE is useless here — it’s better to rewrite the SQL.

After rewriting, the result came back instantly:

explain select min(ID||'') as minval,max(ID||'') as maxval from table_b where time_at >= to_timestamp('2024-03-30 00:00:00','yyyy-MM-dd HH24:mi:ss')

                                                       QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1201418.86..1201418.87 rows=1 width=64)
 ->  Index Scan using idx_time_at on table_b  (cost=0.57..1195919.90 rows=549896 width=33)
       Index Cond: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))

This isn’t really an execution plan regression, because the plan didn’t change. A few days ago it had the same plan but ran fast — the reason is tied to data distribution and the LIMIT mechanism: when data is quickly found, it returns immediately (which is why the optimizer chose the primary key index); when it’s “unlucky” and the matching data is far away, it takes a very long time.

Summary
#

A classic case:

A small table with frequent updates
A long transaction preventing dead tuple reclamation
The long transaction itself was caused by an index selection problem due to sorting and LIMIT operations (ORDER BY, MAX/MIN, GROUP can all trigger this)

One incident, three classic PostgreSQL knowledge points — quite representative.

A Classic Case of Long Transaction, Table Bloat, and LIMIT Issues

Slow Primary Key Update — Problem Analysis
#

Bonus SQL Optimization — ORDER BY LIMIT
#

Summary
#

Related

Slow Primary Key Update — Problem Analysis#

Bonus SQL Optimization — ORDER BY LIMIT#

Summary#

Related

Slow Primary Key Update — Problem Analysis
#

Bonus SQL Optimization — ORDER BY LIMIT
#

Summary
#