Skip to main content
  1. Posts/

Why Is 'partition of' Slow When There's No Blocking?

·1015 words·5 mins
liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.
Table of Contents

Analyzing Slow CREATE TABLE.. PARTITION OF Statements
#

2024-05-16 22:02:59.063 CST,"user1","dblzl",125889,"30.88.79.3:37423",66461213.1ebc1,2,"authentication",2024-05-16 22:02:59 CST,34/41364668,0,LOG,00000,"connection authorized: user=user1 database=dblzl",,,,,,,,,"","client backend"
2024-05-16 22:02:59.079 CST,"user1","dblzl",125889,"30.88.79.3:37423",66461213.1ebc1,3,"idle",2024-05-16 22:02:59 CST,34/41364669,0,LOG,00000,"statement:  -- a86fae372f73414bbe1af18213a47beb
/*a86fae372f73414bbe1af18213a47beb */
create table if not exists table1_partition_p2406 partition of table1 for values from ('2024-06-01 00:00:00') to ('2024-07-01 00:00:00'); ",,,,,,,,,"","client backend"
...
2024-05-16 22:38:28.555 CST,"user1","dblzl",125889,"30.88.79.3:37423",66461213.1ebc1,4,"CREATE TABLE",2024-05-16 22:02:59 CST,34/0,0,LOG,00000,"duration: 2129483.549 ms",,,,,,,,,"","client backend"

The user ‘user1’ connected to the database at 22:02:59 and immediately executed a create table.. partition of.. statement, which didn’t complete until 22:38:28. The logs in between are omitted — there was a lot of session blocking information, with session 125889 as the blocking source.

Blocked sessions looked like:

process 33569 still waiting for RowExclusiveLock on relation 53733 of database 17073 after 1000.048 ms","Process holding the lock: 125889. Wait queue: 33569.

When PARTITION OF adds a partition, it acquires an AccessExclusiveLock (level 8) on the parent table, which blocks all operations on the partitioned table. Normally, adding a partition via PARTITION OF is very fast, and the lock is released immediately. However, if there’s a long-running transaction on the partitioned table, the level 8 lock on the parent table must wait, causing subsequent blocking.

(Stolen from my own diagram): diagram

However, in this case there was no long transaction on the table, yet PARTITION OF took 35 minutes.

From historical process information, this process was in D state (uninterruptible sleep), which was suspicious. Initially, I suspected memory or disk issues, but after investigation, everything was normal.

However, this problem was easy to reproduce — running create table partition of directly in a simulation environment was very slow. pg_stat_activity showed the statement waiting on IO:

wait_event_type  | IO
wait_event       | DataFileRead
state            | active
query            | create table xxx partition of xx for values from ('2025-05-01 00:00:00') to ('2025-06-01 00:00:00');

strace tracing revealed the process was heavily reading one file:

pread64(53, "\22\2\0\0\220w\321>\0\0\5\0\24\0018\1\0 \4 \0\0\0\0\200\237\0\1\310\236p\1"..., 8192, 863485952) = 8192

Using file descriptor 53, we identified the file:

[/proc/356174/fd] ll |grep 53
lrwx------ 1 postgres postgres 64 May 17 15:34 53 -> /lzl/pglzl/data/base/17076/25883
oid2name -d lzldb  -f 25883
From database "lzldb":
  Filenode                           Table Name
-----------------------------------------------
     25883  table_partition_default

Finally located: the table table_partition_default:

=> \d+ table_partition_default
...
Partition of: table_partition_default DEFAULT
Partition constraint: (NOT ((date_created IS NOT NULL) AND ((date_created < '2022-05-01 00:00:00'::timestamp without time zone) OR ((date_created >= '2022-05-01 00:00:00'::timestamp without time zone) AND (da


=> \dt+ table_partition_default
                                          List of relations
 Schema |                Name             | Type  | Owner | Persistence | Size  | Description 
--------+------------------------------------+-------+------------+-------------+-------+-------------
 public | table_partition_default         | table | user1 | permanent   | 50 GB | 
(1 row)

It was the default partition table, with tens of GB of data. Oracle DBAs might find this unfamiliar — PG’s default partition receives data that doesn’t fall into any defined partition range. The default partition ensures data is still accepted even if no matching range is defined.

If data exists in the default partition and a new partition needs to cover that range, what happens? It directly throws an error:

=> create table if not exists table_partition_pxxxx partition of table_partition for values from ('2023-01-12 00:00:00') to ('2023-01-13 00:00:00');
ERROR:  23514: updated partition constraint for default partition "table_partition_default" would be violated by some row
SCHEMA NAME:  public
TABLE NAME:  table_partition_default
LOCATION:  check_default_partition_contents, partbounds.c:3227

As you can see, when adding a child partition, the default partition’s partition constraint is automatically modified. The default partition constraint check is essentially validating the default partition’s data against the new partition’s range.

At this point, the cause is clear:

When adding a new child partition to a partitioned table, the partition creation statement needs to validate data in the default partition to ensure the new partition’s data range doesn’t conflict with existing default partition data. This caused CREATE TABLE PARTITION OF to read a massive amount of default partition data, preventing the new partition from being created. The blocking then cascaded, making business data unqueryable and unwritable.

Summary and Recommendations
#

PostgreSQL partitioned tables are becoming increasingly common. Maintaining partitions requires attention to many details. I recommend reading PostgreSQL Partitioned Tables, which covers almost everything.

In this case, the key to resolution is the data in the default partition. Before refactoring the default partition, do not use PARTITION OF to create child partitions.

Default partition refactoring plan:

  1. Detach the default child partition, then properly create child partitions, and reinsert the default table data back into the partitioned table.
  2. If necessary, after detaching and creating proper child partitions, create an empty default partition to maintain business data continuity.
  3. Note that detach differs from attach — detach requires a level 8 lock on the parent table. PG14 supports DETACH CONCURRENTLY.

If you don’t refactor the default partition, check the current data range in the default partition. Using ATTACH to add child partitions will be slow, but won’t block reads and writes.

Finally, a review of best practices for adding partitions:

PARTITION OF requires a level 8 lock on the parent table, which carries risk. The recommended approach is to use ATTACH to add new child partitions (partition indexes can be handled similarly). This does not block reads and writes, has no business impact, and can be done online.

The correct approach for adding new partitions:

CREATE TABLE lzlpartition1_202303
  (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
alter table LZLPARTITION1 attach partition LZLPARTITION1_202303 for values from ('2023-03-01 00:00:00') to ('2023-04-01 00:00:00');

If the new partition already has data, ATTACH may still be slow. You can optimize by pre-creating constraints:

The correct approach for adding a partition that already has data:

-- Reduce verbose DDL by using LIKE
CREATE TABLE lzlpartition1_202303
  (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
-- Skip this step if no data exists. Add a CHECK constraint referencing other partitions' Partition constraint to reduce ATTACH constraint validation time.
alter table lzlpartition1_202303 add constraint chk_202303 CHECK ((date_created IS NOT NULL) AND (date_created >= '2023-03-01 00:00:00'::timestamp without time zone) AND (date_created < '2023-04-01 00:00:00'::timestamp without time zone));
-- Add partition via ATTACH
alter table LZLPARTITION1 attach partition LZLPARTITION1_202303 for values from ('2023-03-01 00:00:00') to ('2023-04-01 00:00:00');
-- Optional. Before transactions occur on the new partition, drop the extra CHECK constraint
alter table lzlpartition1_202303 drop constraint  chk_202303;

Related

A Classic Case of Long Transaction, Table Bloat, and LIMIT Issues

·1309 words·7 mins
Slow Primary Key Update — Problem Analysis # A simple primary key update took over 1 second to execute. Due to high concurrency, the CPU was completely maxed out: 2024-04-01 10:19:36.084 CST,"lzlopr","lzl",158751,"10.33.78.149:51502",66055a6b.26c1f,172,"UPDATE",2024-03-28 19:54:19 CST,528/19816630,970251337,LOG,00000,"duration: 1218.688 ms plan: Query Text: update table_a set (omitted...)=$6 where column_id =$7 Update on table_a (cost=0.40..5.49 rows=1 width=2774) -> Index Scan using pk_id on table_a (cost=0.40..5.49 rows=1 width=2774) Index Cond: ((column_id)::text = $7)",,,,,,,,,"PostgreSQL JDBC Driver","client backend" The SQL itself is very simple — an update with a condition on the primary key. Looking at the execution plan, it used the pk_id primary key index, so there was no problem with the plan itself; the issue wasn’t a plan change.

Analyzing a 5MB SQL That Consumed 70GB of Memory

·2217 words·11 mins
Process Memory Analysis # "WAL writer process (PID 66902) was terminated by signal 6: Aborted",,,,,,,,,"","postmaster" The log shows postmaster process 66902 was killed. Checking OS-level process memory: since top doesn’t show PPID and ps doesn’t show USS, we need both: USER PID PPID PRI %CPU %MEM VSZ RSS WCHAN S STARTED TIME COMMAND postgres 211276 66478 19 8.7 10.6 57488380 56389972 - R 17:13:03 00:02:47 postgres: BIND postgres 211277 66478 19 7.8 9.6 52294700 51127480 - R 17:13:03 00:02:31 postgres: BIND postgres 222749 66478 19 22.7 9.3 51320000 49073368 - R 17:35:33 00:02:09 postgres: BIND postgres 39513 66478 19 2.9 6.8 38651084 36354736 ep_poll S 16:13:03 00:02:43 postgres: idle Using PPID to identify high-memory backend processes. Let’s examine process 211276:

Case Study: Analyzing Occasional Slow INSERT VALUES

·2017 words·10 mins
The business team reported that INSERT VALUES occasionally became slow. By the time I checked the active sessions, the slow write problem had already subsided. Later, I discovered that the slow write problem lasted less than half a minute, with INSERT VALUES taking 1-2 seconds. I wrote a script to capture active session information and managed to get the session data: wait_event | count ---------------------+------- [null] | 11 WALRead | 1 DataFileRead | 2 BgWriterMain | 1 WALWrite | 40 AutoVacuumMain | 1 ClientRead | 385 LogicalLauncherMain | 1 The most abnormal wait event was WALWrite with 40 sessions.