Slow Primary Key Update — Problem Analysis#
A simple primary key update took over 1 second to execute. Due to high concurrency, the CPU was completely maxed out:
2024-04-01 10:19:36.084 CST,"lzlopr","lzl",158751,"10.33.78.149:51502",66055a6b.26c1f,172,"UPDATE",2024-03-28 19:54:19 CST,528/19816630,970251337,LOG,00000,"duration: 1218.688 ms plan:
Query Text: update table_a set (omitted...)=$6 where column_id =$7
Update on table_a (cost=0.40..5.49 rows=1 width=2774)
-> Index Scan using pk_id on table_a (cost=0.40..5.49 rows=1 width=2774)
Index Cond: ((column_id)::text = $7)",,,,,,,,,"PostgreSQL JDBC Driver","client backend"The SQL itself is very simple — an update with a condition on the primary key. Looking at the execution plan, it used the pk_id primary key index, so there was no problem with the plan itself; the issue wasn’t a plan change.
Let’s rewrite the SQL (since it’s an UPDATE) and use explain (analyze,buffers) to compare the execution cost:
=> explain (analyze,buffers) select * from table_a where column_id='d4f713370e584820a9b15e2218ea436a';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table_a (cost=2.91..5.42 rows=1 width=1156) (actual time=55.052..123.354 rows=1 loops=1)
Recheck Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
Heap Blocks: exact=1
Buffers: shared hit=13870
-> Bitmap Index Scan on pk_id (cost=0.00..2.91 rows=1 width=0) (actual time=3.464..3.465 rows=13866 loops=1)
Index Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
Buffers: shared hit=24
Planning:
Buffers: shared hit=4261
Planning Time: 11.028 ms
Execution Time: 123.567 ms
(11 rows)The actual execution plan is fine, but shared hit=13870 is clearly way too high. Normally, a primary key lookup shouldn’t scan that many pages. This strongly suggests table bloat.
Checking table bloat:
-- Table size \dt
Size | 525 MB-- Actual row count
count | 827-- Dead tuples from pg_stat_all_tables
n_live_tup | 786
n_dead_tup | 657604Only ~800 live tuples but 650K dead tuples! This explains why the primary key scan visited so many pages. But why weren’t the dead tuples reclaimed?
When a table exceeds the default 20% modification threshold, autovacuum triggers vacuum to reclaim space. We can see in the logs that autovacuum was indeed being triggered:
2024-04-01 14:13:46.649 CST,,,14081,,660a5099.3701,1,,2024-04-01 14:13:45 CST,259/17828993,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:13:47.801 CST,,,14081,,660a5099.3701,2,,2024-04-01 14:13:45 CST,259/17828994,971045014,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.01 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:14:46.673 CST,,,26136,,660a50d5.6618,1,,2024-04-01 14:14:45 CST,259/17829090,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:14:47.833 CST,,,26136,,660a50d5.6618,2,,2024-04-01 14:14:45 CST,259/17829091,971049759,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:15:46.680 CST,,,40743,,660a5111.9f27,1,,2024-04-01 14:15:45 CST,259/17829164,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:15:47.849 CST,,,40743,,660a5111.9f27,2,,2024-04-01 14:15:45 CST,259/17829165,971055464,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:16:46.677 CST,,,52599,,660a514d.cd77,1,,2024-04-01 14:16:45 CST,259/17829263,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:16:47.844 CST,,,52599,,660a514d.cd77,2,,2024-04-01 14:16:45 CST,259/17829264,971061382,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:17:46.699 CST,,,64858,,660a5189.fd5a,1,,2024-04-01 14:17:45 CST,234/16589539,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:17:47.851 CST,,,64858,,660a5189.fd5a,2,,2024-04-01 14:17:45 CST,234/16589540,971066091,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker"
2024-04-01 14:18:46.703 CST,,,78112,,660a51c5.13120,1,,2024-04-01 14:18:45 CST,259/17829409,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
2024-04-01 14:18:47.854 CST,,,78112,,660a51c5.13120,2,,2024-04-01 14:18:45 CST,259/17829410,971070390,LOG,00000,"automatic analyze of table ""lzl.public.table_a"" system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s",,,,,,,,,"","autovacuum worker" Not only was it triggered, but the interval was exactly 1 minute. The default autovacuum_naptime is 1 minute:
>= show autovacuum_naptime ;
autovacuum_naptime
--------------------
1min
(1 row)We can conclude:
- autovacuum was successfully triggered
- Dead tuples either couldn’t be reclaimed fast enough — the dead tuples generated within 1 minute exceeded 20% (maybe 1 minute is too long); or they weren’t being reclaimed at all, guaranteeing the next autovacuum trigger
Let’s look at the detailed autovacuum output:
2024-04-01 10:22:44.648 CST,,,16827,,660a1a73.41bb,1,,2024-04-01 10:22:43 CST,170/16910186,0,LOG,00000,"automatic vacuum of table ""lzl.public.table_a"": index scans: 0
pages: 0 removed, 48745 remain, 6 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077
buffer usage: 97603 hits, 0 misses, 5 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.028 MB/s
system usage: CPU: user: 0.21 s, system: 0.22 s, elapsed: 1.41 s
WAL usage: 4 records, 3 full page images, 5129 bytes",,,,,,,,,"","autovacuum worker"autovacuum triggered but reclaimed nothing: tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077. oldest xmin represents the oldest transaction in the database — meaning there’s a long-running transaction. This is easy to find:
>= select pid,usename,xact_start,state_change,wait_event,state,query from pg_stat_activity where state<>'idle' order by xact_start ;
pid | usename | xact_start | state_change | wait_event | state |
--------+------------+-------------------------------+-------------------------------+---------------------+---------------------+------------------------------------------------------------------------------
164658 | phbdspsqp | 2024-04-01 08:36:57.275408+08 | 2024-04-01 08:36:57.299609+08 | DataFileRead | active | SELECT "minval","maxval" FROM (select min(ID) as minval,max(TRACKThe long transaction was a SQL that had been running since around 8 AM that morning, for several hours. Even though it wasn’t on the same table, being the oldest xmin it still had an impact.
At this point the root cause is identified:
- Table A had frequent updates, high bloat risk
- A long transaction on table B prevented dead tuple reclamation on table A
- Table A’s update statements scanned excessive pages
Solution:
- Kill the long transaction:
select pg_terminate_backend(164658) - Manually vacuum or wait 1 minute (or less) for automatic vacuum:
vacuum table_a
After both steps were completed, checking dead tuples:
n_live_tup | 707
n_dead_tup | 298650K dead tuples have been cleaned up.
Checking the execution plan again:
=> explain (analyze,buffers) select * from table_a where column_id='d4f713370e584820a9b15e2218ea436a';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using pk_id on table_a (cost=0.40..5.44 rows=1 width=621) (actual time=0.026..0.029 rows=1 loops=1)
Index Cond: ((column_id)::text = 'd4f713370e584820a9b15e2218ea436a'::text)
Buffers: shared hit=6
Planning Time: 0.057 ms
Execution Time: 0.043 msShared hits down to just 6 — issue resolved.
Additionally, vacuum only reclaims dead tuples but does not shrink the table — the table remains the same size. Space can only be returned to the OS when new data reuses those pages, or through a repack/table rebuild:
Size | 525 MBBonus SQL Optimization — ORDER BY LIMIT#
That long-running transaction SQL also had its own problems… The business reported it ran fast a few days ago but took several hours today:
explain select min(ID) as minval,max(ID) as maxval from table_b where time_at >= to_timestamp('2024-03-30 00:00:00','yyyy-MM-dd HH24:mi:ss');
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=4298.54..4298.55 rows=1 width=64)
InitPlan 1 (returns $0)
-> Limit (cost=0.70..2149.27 rows=1 width=32)
-> Index Scan using pk_b on table_b (cost=0.70..1181490202.27 rows=549896 width=32)
Index Cond: ((ID)::text IS NOT NULL)
Filter: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))
InitPlan 2 (returns $1)
-> Limit (cost=0.70..2149.27 rows=1 width=32)
-> Index Scan Backward using pk_b on table_b table_b_1 (cost=0.70..1181490202.27 rows=549896 width=32)
Index Cond: ((ID)::text IS NOT NULL)
Filter: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))The SQL is also simple — only one condition on a time column, with decent selectivity.
However, this SQL did not use the time_at index but instead used the ID primary key index. This is the same LIMIT problem. Running ANALYZE is useless here — it’s better to rewrite the SQL.
After rewriting, the result came back instantly:
explain select min(ID||'') as minval,max(ID||'') as maxval from table_b where time_at >= to_timestamp('2024-03-30 00:00:00','yyyy-MM-dd HH24:mi:ss')
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1201418.86..1201418.87 rows=1 width=64)
-> Index Scan using idx_time_at on table_b (cost=0.57..1195919.90 rows=549896 width=33)
Index Cond: (time_at >= to_timestamp('2024-03-30 00:00:00'::text, 'yyyy-MM-dd HH24:mi:ss'::text))This isn’t really an execution plan regression, because the plan didn’t change. A few days ago it had the same plan but ran fast — the reason is tied to data distribution and the LIMIT mechanism: when data is quickly found, it returns immediately (which is why the optimizer chose the primary key index); when it’s “unlucky” and the matching data is far away, it takes a very long time.
Summary#
A classic case:
- A small table with frequent updates
- A long transaction preventing dead tuple reclamation
- The long transaction itself was caused by an index selection problem due to sorting and LIMIT operations (ORDER BY, MAX/MIN, GROUP can all trigger this)
One incident, three classic PostgreSQL knowledge points — quite representative.