[{"content":"AI rate 5%\nTL;DR # On May 26, the Xinchuang Database List 2026 No. 2 was released, with 23 products passing (8 centralized + 15 distributed) — the most ever. Most notably: Ping An, UnionPay, China Mobile, and China Telecom — four major buyers — had their self-incubated databases debut on the list. The Xinchuang logic has changed — buyers are no longer just buyers.\nThe Latest List # Historical batch statistics for the Xinchuang database list. Data source: China Information Security Evaluation Center (itsec.gov.cn), 8 batches total, 4 containing databases.\nBy Batch\nBatch Date Database Products Achieved Level II 2023#1 2023-12-26 11 (centralized) None 2024#2 2024-09-30 17 (6 centralized + 11 distributed) GaussDB 2025#2 2025-08-22 3 (centralized) None 2026#2 2026-05-26 23 (8 centralized + 15 distributed) Dameng/Yashan/GaussDB/GoldenDB By Appearances (≥2 times)\nVendor Count Dameng 3 GBASE 3 Alibaba Cloud 3 HighGo 2 Tencent Cloud 2 East Golden 2 Vastdata 2 Huawei Cloud 2 ZTE (GoldenDB) 2 OceanBase 2 Kingbase 2 Shentong 2 Xugu 2 Yashan 2 Only 1 time PingCAP/Wanli/Uxin/Ping An/China Mobile/UnionPay/Telecom Cloud/Timecho/Transwarp/DolphinDB/Z-Range/CM Suzhou By Category: Big Tech / Unicorn / Major Buyer\nCategory Vendors Big Tech Huawei Cloud (GaussDB/TaurusDB/DWS), Alibaba Cloud (PolarDB/AnalyticDB), Tencent Cloud (TDSQL), ZTE (GoldenDB), OceanBase (Ant Group) Unicorns PingCAP (TiDB), Yashan (SICS), Transwarp (ArgoDB), Timecho (TimechoDB), DolphinDB Major Buyers Ping An Tech (RASESQL), China UnionPay (UPDRDB), China Mobile (Panwei + He3DB), China Telecom Cloud (TeleDB) Traditional Xinchuang Dameng, Kingbase, GBASE, Shentong, HighGo, Xugu, Vastdata, East Golden, Wanli, Uxin The Floodgates Open # When this list came out, my reaction was four words: the floodgates opened. 23 products — the most ever. A few highlights:\nPing An RASESQL. The most unexpected. Ping An Group\u0026rsquo;s fintech capabilities have always been strong, but there was almost no public information about them building a database. Seeing \u0026ldquo;RASESQL\u0026rdquo; on the list stunned me for several seconds. A financial buyer of Ping An\u0026rsquo;s scale — once their self-developed database passes national testing, their internal Xinchuang replacement roadmap gains one more path.\nUnionPay UPDRDB. Equally mysterious. I had no idea UnionPay was building a distributed database before this. UnionPay\u0026rsquo;s transaction volume speaks for itself — a distributed database that can handle their own business won\u0026rsquo;t be technically weak.\nAlibaba Cloud PolarDB for MySQL. The MySQL-compatible edition of PolarDB not passing had been something many people remembered. Now, all three of PolarDB\u0026rsquo;s main lines — PG edition, distributed edition, MySQL edition — have passed. Add AnalyticDB, and Alibaba Cloud\u0026rsquo;s database family is basically complete.\nChina Mobile Panwei + China Telecom TeleDB. China Mobile already had He3DB (CM Suzhou) pass national testing last year; this year Panwei is their second product. China Telecom TeleDB debuts. Both telecom operators now have their own incubated Xinchuang databases, which should significantly reduce their respective Xinchuang replacement pressure. Interestingly, China Unicom has been silent — their Xinchuang strategy is clearly different from Mobile and Telecom.\nTranswarp ArgoDB. Transwarp started in the big data/Hadoop ecosystem and now their distributed database has passed national testing. Once crowned \u0026ldquo;China\u0026rsquo;s First Domestic Big Data Infrastructure Software Stock\u0026rdquo; with a market cap exceeding 30 billion, their path from data lake to Xinchuang database has been validated.\nImpact # The most important signal from this floodgate opening: buyers can self-develop databases.\nWhat are the implications?\nMajor buyers who succeed at self-development don\u0026rsquo;t have to be lambs to the slaughter. Those major buyers who haven\u0026rsquo;t built one yet may restart their self-development efforts. The market share that big tech and unicorns could compete for in the domestic database market just shrank. Financial industry players UnionPay and Ping An, telecom players China Mobile and China Telecom — all passed national testing, effectively earning a \u0026ldquo;R\u0026amp;D Success\u0026rdquo; gold badge. Internally, each organization must be celebrating. For external vendors, what they\u0026rsquo;ve lost isn\u0026rsquo;t just major clients — more precisely, they\u0026rsquo;ve lost absolute bargaining power.\n\u0026ldquo;I know you\u0026rsquo;re in a tough spot, and I know you can\u0026rsquo;t afford not to buy, so I\u0026rsquo;ll swap the butcher\u0026rsquo;s knife for a dragon-slaying blade and slaughter you to death\u0026rdquo; — for buyers who successfully incubated their own databases, this kind of predicament has been substantially eased. That\u0026rsquo;s significant.\nAs for where Xinchuang policy goes next, nobody can say. Based on previous lists, things should be getting stricter (last time only 3 databases passed), but this time they unexpectedly opened the floodgates. A sharp contraction next round isn\u0026rsquo;t impossible. Not just China Unicom — insurance industry players like CPIC and PICC, and even capable financial institutions, could consider jumping in to hand-roll their own database.\nBittersweet Reflections # Since our kernel team sits right behind me, I have some understanding of the Xinchuang R\u0026amp;D process. After consecutive failed submissions, the entire team\u0026rsquo;s morale was extremely low. I believe we weren\u0026rsquo;t the only ones — many teams whose submissions failed felt the same. For industries like finance and telecom, there\u0026rsquo;s a Xinchuang mandate, but if your self-developed product doesn\u0026rsquo;t pass approval, there\u0026rsquo;s no choice at the corporate strategy level, and at the team level, there\u0026rsquo;s no reason for existence. That\u0026rsquo;s why \u0026ldquo;passing national testing\u0026rdquo; carries such weight and influence. Thankfully they passed — heartfelt congratulations to them! RaseSQL No.1!\nAt the same time, it\u0026rsquo;s clear that Xinchuang results and direction are unstable, volatile, and impactful. It determines some companies\u0026rsquo; strategies and many people\u0026rsquo;s fates. I myself am even a piece on this wheel of fortune.\nBeyond those on the list, many organizations poured enormous effort but remain off the list. Their products might be terrible, or they might be excellent. But national testing is that stark watershed — a mysterious ticket of admission. Pass or fail — in the domestic market, those are two entirely different concepts.\nOK, just some thoughts — might delete later.\nReference # https://www.itsec.gov.cn/aqkkcp/cpgg/\nOriginal link: https://lastdba.com/2026/05/29/xinchuang-db-2026-review/\n","date":"May 29, 2026","externalUrl":null,"permalink":"/en/2026/05/29/a-dbas-perspective-on-the-0526-approved-database-list/","section":"Posts","summary":"AI rate 5%\nTL;DR # On May 26, the Xinchuang Database List 2026 No. 2 was released, with 23 products passing (8 centralized + 15 distributed) — the most ever. Most notably: Ping An, UnionPay, China Mobile, and China Telecom — four major buyers — had their self-incubated databases debut on the list. The Xinchuang logic has changed — buyers are no longer just buyers.\nThe Latest List # Historical batch statistics for the Xinchuang database list. Data source: China Information Security Evaluation Center (itsec.gov.cn), 8 batches total, 4 containing databases.\n","title":"A DBA's Perspective on the 0526 Approved Database List","type":"posts"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/approved-list/","section":"Tags","summary":"","title":"Approved List","type":"tags"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/domestic-databases/","section":"Tags","summary":"","title":"Domestic Databases","type":"tags"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/postgresql/","section":"Tags","summary":"","title":"PostgreSQL","type":"tags"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/categories/postgresql%E5%86%85%E5%8A%9F%E4%BF%AE%E7%82%BC/","section":"Categories","summary":"","title":"PostgreSQL内功修炼","type":"categories"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.\n80 articles · 8 categories · updating ","date":"May 29, 2026","externalUrl":null,"permalink":"/en/","section":"The Last DBA","summary":"PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.\n80 articles · 8 categories · updating ","title":"The Last DBA","type":"page"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/uuid/","section":"Tags","summary":"","title":"UUID","type":"tags"},{"content":"Source material: HN UUID v4 Collision Thread, dev.to UUID Benchmark\nAI-generated ratio: 99%\nTL;DR # UUID v4 collided — someone on HackerNews actually hit a real collision. The root cause was a software stack bug, not math. v4 and v7 have no fundamental difference in collision safety. The real difference is index performance: v7 is time-ordered, B-tree is more compact, writes are 35% faster, indexes are 22% smaller. Your UUID v4 is probably fine, but if you care about index performance, switching to v7 is a cheap win.\nThe UUID v4 Collision Incident # A HackerNews thread blew up — Ask HN: We just had an actual UUID v4 collision\u0026hellip;, 479 upvotes, 347 comments.\nThe OP\u0026rsquo;s own words:\nI know what you\u0026rsquo;re thinking\u0026hellip; and I still can\u0026rsquo;t believe it, but\u0026hellip; This morning, our database flagged a duplicate UUID (v4).\nIt wasn\u0026rsquo;t a double-insert bug. The code didn\u0026rsquo;t write it twice. Only ~15,000 rows in the table, using npm\u0026rsquo;s uuid package uuidv4(), and two rows created at different times collided on the same UUID:\nb6133fd6-70fe-4fe3-bed6-8ca8fc9386cd What\u0026rsquo;s the probability of a UUID v4 collision? 122 random bits, 2^122 ≈ 5.3×10^36 possibilities. With 15,000 records, collision probability is roughly 2×10^-29. Theoretically \u0026ldquo;impossible.\u0026rdquo;\nBut it happened.\nCause 1: Unreliable entropy sources # HN\u0026rsquo;s top-voted comment (jandrewrogers):\nUUIDv4 security depends on high-quality entropy sources. Hardware defects, software bugs, and misunderstandings of \u0026ldquo;high-quality entropy\u0026rdquo; all break this assumption. Detecting entropy source failures is expensive, so nobody checks — until a collision happens.\nUUID v4 is explicitly banned in high-reliability systems because entropy source quality cannot be verified.\nCause 2: Known npm uuid package bugs # The npm uuid package README itself warns:\nThis module may generate duplicate UUIDs when run in clients with deterministic random number generators, such as Googlebot crawlers.\nMore seriously, its internal rng() function has global mutable state. One commenter pointed out: calling rng() and sending the result effectively overwrites someone else\u0026rsquo;s random number, and you can predict it.\nRelated commit: 91805f665c\nCommunity advice: use Node.js built-in crypto.randomUUID(), not the npm uuid package.\nCause 3: Linux kernel /dev/random race condition # Another comment:\nI encountered duplicate UUIDs during soak testing of a distributed system. After extensive debugging, I found it was a Linux kernel race condition bug — on multi-processor systems, two processes simultaneously reading /dev/random could, with extremely low probability (~one in a million), get the same bytes.\nCause 4: Go UUID library not checking return values # Early Go UUID libraries called random functions without checking the return value length. \u0026ldquo;Request N bytes, got 3 bytes back\u0026rdquo; never happened on most hardware, so nobody checked — until production, where it generated thousands of duplicate UUIDs.\nCause 5: Historical AMD CPU RNG defects # Certain AMD CPUs had built-in random number generator issues. VM environments can also \u0026ldquo;virtualize away\u0026rdquo; entropy — both time sources and entropy sources can degrade inside VMs.\nv4 and v7 have no fundamental difference in collision safety. The difference is in the first 48 bits — v4 is random, v7 is a timestamp. You\u0026rsquo;re unlikely to encounter timestamp source issues, and random source issues are equally rare. The HN thread is an interesting edge case. Knowing that a tiny number of people hit it is enough — you don\u0026rsquo;t need to distrust the UUID v4 in your own systems.\nWhen choosing v4 vs v7, what you should really look at isn\u0026rsquo;t collisions — it\u0026rsquo;s index performance.\nUUID v7 Performance Comparison in PG 16 # UUID v7 has one concrete advantage over v4 in PostgreSQL: temporal clustering, more B-tree-friendly. v4 can bloat and v7 can bloat too — the difference is simply that v7\u0026rsquo;s first 48 bits are time-ordered, so inserts concentrate on the right side of the B-tree, reducing page splits.\nUmang Sinha\u0026rsquo;s benchmark ran a rigorous comparison on a PG 16 Docker container (8 cores, 16GB, NVMe).\nTest Conditions # CREATE TABLE uuid_v4_test (id UUID PRIMARY KEY, payload TEXT); CREATE TABLE uuid_v7_test (id UUID PRIMARY KEY, payload TEXT); Parameter Value Data volume 10 million rows per table Batch size 10,000 rows per batch Client Go + pq driver UUID generation Pre-generated in memory, not timed Performance Results # Metric UUID v4 UUID v7 Improvement Write 10M rows 5 min 35 sec 3 min 38 sec 35% faster Table + index total size 3618 MB 3443 MB 5% smaller B-tree index size 776 MB 602 MB 22% smaller Point lookup 0.167 ms 0.038 ms 4.4x faster Range scan 8.283 ms 3.791 ms 2.2x faster Why Such a Big Difference # UUID v4 is fully random. Newly inserted UUIDs scatter randomly across the B-tree index, causing massive page splits and severe index fragmentation. UUID v7 has a millisecond-precision timestamp in the first 48 bits, so newly generated UUIDs are naturally ordered — writes cluster on the right side of the B-tree, page splits drop dramatically, and the index is much more compact.\nThe 22% smaller index isn\u0026rsquo;t magic — it\u0026rsquo;s reduced fragmentation. Point lookups being 4x faster isn\u0026rsquo;t surprising either — fewer B-tree levels, higher cache hit rates.\nSummary # UUID v4 and v7 are identical in collision safety — both depend on entropy source quality, one fills the first 48 bits with random numbers, the other with a timestamp. Collisions are edge cases that a tiny number of people hit in specific environments. Your environment is probably fine — that basic judgment doesn\u0026rsquo;t change.\nWhat you really should think about is index performance. v7\u0026rsquo;s temporal property makes B-trees more compact, with measured results of 35% faster writes, 22% smaller indexes, and 2-4x faster queries. If your system writes UUIDs at high volume, switching to v7 saves meaningful storage and CPU.\nPG 18 will natively support gen_uuid_v7(). For now, generate UUIDs at the application layer. Whichever version you use, always add a UNIQUE constraint.\nThis article was originally published in Chinese on lastdba.com.\n","date":"May 29, 2026","externalUrl":null,"permalink":"/en/2026/05/29/uuid-v4-and-v7-collision-incidents-and-performance-benchmarks/","section":"Posts","summary":"Source material: HN UUID v4 Collision Thread, dev.to UUID Benchmark\nAI-generated ratio: 99%\nTL;DR # UUID v4 collided — someone on HackerNews actually hit a real collision. The root cause was a software stack bug, not math. v4 and v7 have no fundamental difference in collision safety. The real difference is index performance: v7 is time-ordered, B-tree is more compact, writes are 35% faster, indexes are 22% smaller. Your UUID v4 is probably fine, but if you care about index performance, switching to v7 is a cheap win.\n","title":"UUID v4 and v7: Collision Incidents and Performance Benchmarks","type":"posts"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/xinchuang/","section":"Tags","summary":"","title":"Xinchuang","type":"tags"},{"content":"","date":"2026-05-29","externalUrl":null,"permalink":"/tags/%E4%BF%A1%E5%88%9B/","section":"Tags","summary":"","title":"信创","type":"tags"},{"content":"","date":"2026-05-29","externalUrl":null,"permalink":"/tags/%E5%9B%BD%E4%BA%A7%E6%95%B0%E6%8D%AE%E5%BA%93/","section":"Tags","summary":"","title":"国产数据库","type":"tags"},{"content":"","date":"2026-05-29","externalUrl":null,"permalink":"/tags/%E5%AE%89%E5%8F%AF/","section":"Tags","summary":"","title":"安可","type":"tags"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/%E6%80%A7%E8%83%BD/","section":"Tags","summary":"","title":"性能","type":"tags"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/categories/%E6%9D%82%E9%A1%B9/","section":"Categories","summary":"","title":"杂项","type":"categories"},{"content":"","date":"May 29, 2026","externalUrl":null,"permalink":"/en/tags/%E7%B4%A2%E5%BC%95/","section":"Tags","summary":"","title":"索引","type":"tags"},{"content":"","date":"May 27, 2026","externalUrl":null,"permalink":"/en/tags/agent/","section":"Tags","summary":"","title":"Agent","type":"tags"},{"content":"","date":"May 27, 2026","externalUrl":null,"permalink":"/en/tags/ai/","section":"Tags","summary":"","title":"AI","type":"tags"},{"content":"","date":"May 27, 2026","externalUrl":null,"permalink":"/en/categories/aiops/","section":"Categories","summary":"","title":"AIOps","type":"categories"},{"content":"","date":"May 27, 2026","externalUrl":null,"permalink":"/en/tags/mcp/","section":"Tags","summary":"","title":"MCP","type":"tags"},{"content":"Original: Building an MCP Server Using Postgres, Bruce Momjian, PGDay Armenia 2026, CC BY 4.0.\nAI-generated ratio: 80%\nBruce Momjian (PG core team, the one who has written release notes for 20+ years) recently gave a talk at PGDay Armenia 2026: Building an MCP Server Using Postgres. 70 slides, extremely dense. Theory and practice — a solid reference.\nReading it directly is hard work. Even having AI interpret it probably won\u0026rsquo;t make sense at first glance. I had to read for a while and ask several questions before it clicked.\nThese 70 slides can be cleanly split into two layers — the first half is theory, the second half is a hands-on demo. The two layers don\u0026rsquo;t have much to do with each other.\nTheory Layer: Explaining the RAG → MCP Evolution Through Transformers (Slides 1-33) # The theory layer takes up nearly half the content, from LLM fundamentals to how MCP works. The outline is clear:\nRAG vs MCP: In One Sentence # Everyone knows the RAG workflow: the programmer decides what data to query → retrieval results are appended to the system prompt → the LLM reads and generates a response. Pre-orchestrated — what the LLM can see is decided before the user even asks.\nMCP is different. Tool descriptions are registered with the LLM, and the LLM decides for itself during generation whether to call a tool and which one. Dynamic decision-making — the programmer only exposes tools, the LLM handles orchestration.\nBruce sums it up in one sentence:\nRAG can only do what the programmer pre-planned. MCP can dynamically adjust based on output quality, can iteratively call multiple tools, and can trigger external tasks.\n\u0026ldquo;Word or MCP\u0026rdquo; — That Set of Vector Embedding Diagrams # Slides 18-33 are the core of the theory layer. Bruce draws a detailed internal Transformer flow diagram:\nHis logic: take each MCP tool\u0026rsquo;s description text (e.g., \u0026ldquo;Return the radiation level (CPM) at 13 Roberts Road\u0026hellip;\u0026rdquo;), embed it into a vector using a text embedding model, and inject it into the attention layer\u0026rsquo;s vector space. Then at each inference step, the output vector matches against the nearest vector —\n\u0026ldquo;The closest vector might be a word or an MCP.\u0026rdquo;\nIs This Model Correct? # This is what puzzled me the most. Here are my thoughts.\nBruce\u0026rsquo;s 15 slides are beautifully drawn, but if you try to understand them as engineering implementation, there are problems:\n① MCP tools don\u0026rsquo;t need \u0026ldquo;embedding.\u0026rdquo; In actual engineering, tool definitions are written directly into the system prompt as text. The LLM reads \u0026ldquo;You have these tools: geiger(), get_pretzel_inventory()…\u0026rdquo; and uses semantic understanding to decide when to call them. There\u0026rsquo;s no need to compute tool descriptions as vectors, no need to do cosine distance comparisons against word vectors. The essence of Bruce\u0026rsquo;s teaching model is explaining \u0026ldquo;LLM decision-making\u0026rdquo; as \u0026ldquo;nearest vector matching\u0026rdquo; — this is closer to the retrieval paradigm than the generation paradigm.\n② Attention doesn\u0026rsquo;t produce a \u0026ldquo;find nearest\u0026rdquo; operation. output = Σ(softmax(Q·K) × V) yields a weighted-mixed context vector. There\u0026rsquo;s no step of \u0026ldquo;binary choice between the word embedding table and the tool embedding table.\u0026rdquo; The actual mechanism for LLM tool selection is: attention produces hidden states → LM head → softmax over vocabulary → output tool call JSON. There\u0026rsquo;s never a \u0026ldquo;word vs tool\u0026rdquo; choice, only a softmax over the entire vocabulary.\n③ System prompt and user prompt have no boundary in attention. A token sequence is just a token sequence — attention blocks do Q·K dot products on all tokens equally. There is no \u0026ldquo;system zone\u0026rdquo; or \u0026ldquo;user zone.\u0026rdquo;\nSo these 33 theory slides can be seen as a simplified teaching model Bruce built for DBAs without an AI background — visually appealing and easy to understand, but don\u0026rsquo;t use it as an architecture diagram. MCP\u0026rsquo;s truly revolutionary aspect is protocol standardization (unified tool registration/discovery/calling spec), not any vectorization trick.\nPractice Layer: Two Working Demos (Slides 34-69) # Starting from Slide 34, the style abruptly shifts — all code, terminal output, hardware photos. That entire Transformer vector model from the theory layer completely disappears, replaced by curl, psql, and Perl scripts.\nThe only thread connecting the two layers is that \u0026ldquo;they\u0026rsquo;re both talking about MCP.\u0026rdquo; But the vector matching mechanism painted in the theory layer and the actual implementation in the practice layer are nearly two different logic systems. This may be exactly the tension Bruce intended — the theory layer helps you understand why MCP is stronger than RAG, and the practice layer tells you how to actually implement it today.\nDemo 1: Letting ChatGPT Read a Real-World Geiger Counter # Bruce set up a GQ GMC-800 Geiger counter (radiation detector) in his backyard, connected via USB to a Raspberry Pi, taking environmental radiation readings every 15 minutes. First, see ChatGPT using MCP to call real data:\nMCP can call external tools to get real-time data — something RAG cannot do.\nConnected to hardware:\nWrote a Python wrapper using fastmcp:\nfrom fastmcp import FastMCP mcp = FastMCP(\u0026#34;Geiger counter MCP server\u0026#34;) @mcp.tool def geiger() -\u0026gt; int: \u0026#34;\u0026#34;\u0026#34;Return the radiation level (CPM) at 13 Roberts Road, Newtown Square, PA, USA\u0026#34;\u0026#34;\u0026#34; return subprocess.check_output( \u0026#34;/var/lib/postgresql/tmp/geiger\u0026#34;, shell=True, text=True ) The underlying layer is a Perl script that sends \u0026lt;GETCPM\u0026gt;\u0026gt; over serial, reads back a 4-byte CPM value. Apache reverse-proxies port 443 (OpenAI only talks to 443). After registering with ChatGPT:\nUser: What\u0026#39;s the radiation level at 13 Roberts Road? GPT: I don\u0026#39;t have public data for that location... User: Use my custom app GPT: [calls geiger tool] → 14 CPM. Normal background radiation (5-25 CPM). User: Take five readings and give me the average GPT: [calls ×5] 15 16 13 15 15 → average 14.8 CPM Two key behaviors:\nThe LLM can iteratively call tools and compute — RAG is a one-shot data dump, MCP is \u0026ldquo;call → get result → decide → call again → compute\u0026rdquo; The user must explicitly authorize — the first time, ChatGPT didn\u0026rsquo;t say \u0026ldquo;I have your Geiger counter data.\u0026rdquo; Only when the user said \u0026ldquo;use my custom app\u0026rdquo; did the tool call trigger. The security model is conservative Demo 2: Using PG as a Pretzel Shop Inventory System # From hardware back to software. Building a pretzel inventory database:\nCREATE TABLE pretzel ( quantity INTEGER CHECK (quantity \u0026gt;= 0) ); INSERT INTO pretzel VALUES (0); -- initial inventory 0 MCP tools use psql to operate on PG directly:\n@mcp.tool def get_pretzel_inventory() -\u0026gt; int: \u0026#34;\u0026#34;\u0026#34;Return the number of unsold pretzels\u0026#34;\u0026#34;\u0026#34; return subprocess.check_output( \u0026#34;psql --tuples-only -c \u0026#39;SELECT quantity FROM pretzel;\u0026#39; -d mcp\u0026#34;, shell=True, text=True ) @mcp.tool def sold_one_pretzel() -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Call this when a pretzel is sold; reduces inventory by one\u0026#34;\u0026#34;\u0026#34; return subprocess.check_output( \u0026#34;psql --tuples-only -c \u0026#39;UPDATE pretzel SET quantity = quantity - 1;\u0026#39; -d mcp\u0026#34;, shell=True, text=True ) @mcp.tool def baked_6_pretzels() -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Call this when a tray of 6 pretzels is baked; increases inventory\u0026#34;\u0026#34;\u0026#34; return subprocess.check_output( \u0026#34;psql --tuples-only -c \u0026#39;UPDATE pretzel SET quantity = quantity + 6;\u0026#39; -d mcp\u0026#34;, shell=True, text=True ) Interaction flow:\nUser: How many pretzels available? GPT: 0 pretzels. User: I just baked a tray → 6 pretzels User: I sold two → 4 remaining User: I sold four → 0 remaining User: I sold one pretzel → ERROR! CHECK constraint prevented negative quantity The LLM doesn\u0026rsquo;t write SQL directly — it calls your predefined, controlled interfaces. PG\u0026rsquo;s CHECK constraints naturally form a safety net — even if the LLM is tricked into calling the wrong function, the database-level constraint provides a second line of defense.\nBut this also exposes a problem: the LLM faithfully executed sold_one_pretzel, but didn\u0026rsquo;t anticipate that \u0026ldquo;inventory is 0, calling it will error.\u0026rdquo; MCP is the execution layer, not the reasoning layer.\nHow Far from Production # On the final slide, Bruce frankly admits the current implementation\u0026rsquo;s limitations:\nNo authentication — anyone can call your MCP Server No parameterization — all three tools are parameterless functions; real-world tools need to accept parameters No security restrictions on dynamic SQL — tool descriptions declare semantics, but the LLM could be injected with malicious content Connection pooling, transaction management, rate limiting — none addressed Two recommended practical reads:\npgedge.com: Lessons Learned Writing an MCP Server for PostgreSQL CardinalOps: MCP Defaults — Hidden Dangers of Remote Deployment Between the Two Layers # Looking back at these 70 slides, the most interesting part isn\u0026rsquo;t any single demo — it\u0026rsquo;s how the theoretical thinking and hands-on work together explain what MCP can do:\nThe theory layer uses Transformer vector spaces to explain \u0026ldquo;how the LLM chooses between words and tools\u0026rdquo; — this is a teaching model The practice layer uses psql, curl, and Perl scripts to actually implement things — this is engineering The real MCP mechanism — tool definitions inserted as text into the system prompt, the LLM using semantic understanding to decide which tool to call, outputting tool call JSON — needs none of the vector embedding model from the theory layer. Between the two layers, Bruce didn\u0026rsquo;t draw the connecting line. This might not be a bug — it might be a feature.\nThis article was originally published in Chinese on lastdba.com.\n","date":"May 27, 2026","externalUrl":null,"permalink":"/en/2026/05/27/when-postgresql-becomes-ais-hands-bruce-momjians-mcp-server-in-practice/","section":"Posts","summary":"Original: Building an MCP Server Using Postgres, Bruce Momjian, PGDay Armenia 2026, CC BY 4.0.\nAI-generated ratio: 80%\nBruce Momjian (PG core team, the one who has written release notes for 20+ years) recently gave a talk at PGDay Armenia 2026: Building an MCP Server Using Postgres. 70 slides, extremely dense. Theory and practice — a solid reference.\nReading it directly is hard work. Even having AI interpret it probably won’t make sense at first glance. I had to read for a while and ask several questions before it clicked.\n","title":"When PostgreSQL Becomes AI's Hands — Bruce Momjian's MCP Server in Practice","type":"posts"},{"content":" It\u0026rsquo;s Live! # The blog is finally live.\nURL: https://lastdba.com\nAccessible from China, mobile-friendly too.\n76 articles — all PostgreSQL writing from the past few years: case studies, internals, source code analysis, paper deep reads.\nThis is a proper launch: new framework, new domain, new theme — rebuilt from the ground up.\nHighlights # Clean Interface\nMinimalist, reader-friendly design with a useful search feature.\nFramework: Jekyll → Hugo\nVersion 1: Jekyll + minima theme + 2000 lines of CSS\nVersion 2: Hugo + Blowfish theme + 0 lines of CSS\nV1 was decent, but building the UI myself was exhausting. I remembered vonng had written an article about website architecture choices, so I just went and borrowed from it. I explained the architecture to AI and had it learn from vonng.com — the page quality jumped up a level instantly. A few more tweaks and it was done.\nDomain: github.io → lastdba.com\nBought lastdba.com, configured Cloudflare. GitHub Pages with custom domain, free HTTPS certificate, auto-renewal. Now accessible without VPN!\nImage Localization\nPreviously, article images were scattered everywhere — CSDN CDN, GitHub PicBed, Modb OSS. CSDN has hotlink protection. GitHub PicBed on foreign networks often failed to load domestically. This time I had AI consolidate everything to local paths. No more worrying about image hosts going down. Cross-network image loading problems solved — very good.\nReflections on Going Live # I\u0026rsquo;d actually set up a blog URL before — just fork a blog project and deploy via GitHub Pages. The domain was liuzhilong62.github.io/blogs. But being somewhat of a quality freak (not really), the results were mediocre so I took it down. Later I just used the GitHub repo as my blog, without even enabling Pages. Recently, with more free time for various reasons, I revisited this and used Hermes to build the blog from scratch.\nAs a DBA and backend engineer, I know nothing about frontend stuff like Jekyll, Hugo, Blowfish, CSS. I just give Hermes a target and it does the work. When it explains things to me I don\u0026rsquo;t understand (and I\u0026rsquo;m too embarrassed to admit it), I basically just say \u0026ldquo;keep going.\u0026rdquo; I check the result in the browser — if I\u0026rsquo;m satisfied, great; occasionally I say \u0026ldquo;revert this.\u0026rdquo;\nHonestly, my biggest takeaway from switching to Hugo wasn\u0026rsquo;t technical — it was \u0026ldquo;don\u0026rsquo;t reinvent the wheel.\u0026rdquo; I\u0026rsquo;d spent so much time hand-coding dark mode, TOC, search, only to discover a theme swap includes it all, and theirs looks better than mine.\nAlso, after hooking up lastdba.com, the blog suddenly felt \u0026ldquo;official.\u0026rdquo; liuzhilong62.github.io/blogs felt like a personal experiment; now it feels like a real website. Same content, different feeling.\nWhat It Cost # All expenses:\nItem Cost lastdba.com domain (Cloudflare, 1 year) ¥70 GitHub Pages hosting ¥0 Hugo framework ¥0 Blowfish theme ¥0 Cloudflare DNS + CDN ¥0 Tokens ¥60 Total ¥130 Possibly the most cost-effective personal website solution out there.\nFinally # Some details may not be polished — feedback, bug reports, and optimization suggestions welcome.\nI\u0026rsquo;ll likely keep updating.\nReference # https://vonng.com/\nOriginal link: https://lastdba.com/2026/05/16/个人博客上线/\n","date":"May 16, 2026","externalUrl":null,"permalink":"/en/2026/05/16/my-blog-is-live/","section":"Posts","summary":"It’s Live! # The blog is finally live.\nURL: https://lastdba.com\nAccessible from China, mobile-friendly too.\n76 articles — all PostgreSQL writing from the past few years: case studies, internals, source code analysis, paper deep reads.\nThis is a proper launch: new framework, new domain, new theme — rebuilt from the ground up.\nHighlights # Clean Interface\nMinimalist, reader-friendly design with a useful search feature.\n","title":"My Blog is Live","type":"posts"},{"content":" Problem Symptoms # The database instance\u0026rsquo;s RSS memory was maxed out, OOM messages appeared in the logs, and the instance died. We won\u0026rsquo;t analyze the OOM cause here.\nBut startup kept failing — 4 or 5 attempts according to the logs:\n2026-02-12 09:15:21 CST::@:[578272]: FATAL: pre-existing shared memory block (key 2048, ID 1328250881) is still in use 2026-02-12 09:15:21 CST::@:[578272]: HINT: Terminate any old server processes associated with data directory \u0026#34;/data\u0026#34;. 2026-02-12 09:15:21 CST::@:[578272]: LOG: database system is shut down 2026-02-12 09:21:03 CST::@:[658824]: FATAL: pre-existing shared memory block (key 2048, ID 1328250881) is still in use 2026-02-12 09:21:03 CST::@:[658824]: HINT: Terminate any old server processes associated with data directory \u0026#34;/data\u0026#34;. 2026-02-12 09:21:03 CST::@:[658824]: LOG: database system is shut down 2026-02-12 09:31:12 CST::@:[794791]: LOG: redirecting log output to logging collector process 2026-02-12 09:31:12 CST::@:[794791]: HINT: Future log output will appear in directory \u0026#34;/data/pg_log\u0026#34;. 2026-02-12 09:31:37 CST::@:[801049]: FATAL: lock file \u0026#34;postmaster.pid\u0026#34; already exists 2026-02-12 09:31:37 CST::@:[801049]: HINT: Is another postmaster (PID 794791) running in data directory \u0026#34;/data\u0026#34;? 2026-02-12 09:32:34 CST::@:[814396]: FATAL: lock file \u0026#34;postmaster.pid\u0026#34; already exists 2026-02-12 09:32:34 CST::@:[814396]: HINT: Is another postmaster (PID 794791) running in data directory \u0026#34;/data\u0026#34;? Startup succeeded after the DBA ran ipcrm -m xxx before starting.\nAlthough the issue was quickly resolved, many questions remained:\nWhy isn\u0026rsquo;t this scenario more common in practice? The start.log shows two different error types — what operations and logic do they correspond to? Can shared memory still exist even if the postmaster is gone? How do you locate and clean up this shared memory segment? PG has multiple shared memory segments — which one is this? Besides ipcrm -m, are there other ways to get the instance started? Error Analysis: pre-existing shared memory block # Three Types of Shared Memory # Normally, after PG starts, there are three shared memory segments.\nUsing the default shared_memory_type='mmap' without huge pages as an example:\n## View PG\u0026#39;s actual shared memory usage from its virtual memory map cat /proc/`head -1 $PGDATA/postmaster.pid`/smaps | grep -E \u0026#34;\\-s\u0026#34; 2b61b0563000-2b61b0564000 rw-s 00000000 00:04 116293664 /SYSV00001000 (deleted) 2b61b057f000-2b61b05b3000 rw-s 00000000 00:12 1501001168 /dev/shm/PostgreSQL.1193490778 2b61bbac2000-2b61fa67a000 rw-s 00000000 00:04 1500999610 /dev/zero (deleted) From top to bottom, these are: the SysV shared memory used at startup, shared memory for parallel queries, and shared memory for shared_buffers.\nIf shared_buffers uses huge pages, or if the shared_memory_type is SysV instead of mmap, the output differs slightly.\nHuge pages:\n2aaaaac00000-2aba9ca00000 rw-s 00000000 00:0e 48453452 /anon_hugepage (deleted) 2b08f2eea000-2b08f2eeb000 rw-s 00000000 00:04 50692152 /SYSV00001000 (deleted) 2b08f2f05000-2b08f302d000 rw-s 00000000 00:12 48436142 /dev/shm/PostgreSQL.1345689218 shared_memory_type = \u0026lsquo;sysv\u0026rsquo;:\n2b03b3ceb000-2b03b3d1f000 rw-s 00000000 00:12 1572332304 /dev/shm/PostgreSQL.2883611352 2b03bf0c2000-2b03fdc7a000 rw-s 00000000 00:04 143917075 /SYSV00001000 (deleted) Summary:\nPG Shared Memory Config smaps Segments shared_buffers smaps sysv smaps shared_memory_type=mmap, no huge pages 3 segments /dev/zero /SYSV00001000 shared_memory_type=sysv, no huge pages 2 segments /SYSV00001000 /SYSV00001000 shared_memory_type=mmap, with huge pages 3 segments /anon_hugepage /SYSV00001000 shared_memory_type=sysv, with huge pages not supported not supported Now the key question: when the error says pre-existing shared memory block, which shared memory segment is it talking about?\nSource Code Analysis # Searching for the error message in the source quickly leads to the key location: src/backend/port/sysv_shmem.c\nFirst, understand what the SysV shmem is for. From scattered README content:\nWe still require a SysV shmem block to * exist, though, because mmap\u0026#39;d shmem provides no way to find out how * many processes are attached, which we need for interlocking purposes. * As of PostgreSQL 9.3, we normally allocate only a very small amount of * System V shared memory, and only for the purposes of providing an * interlock to protect the data directory. The real shared memory block * is allocated using mmap(). This works around the problem that many * systems have very low limits on the amount of System V shared memory * that can be allocated. Even a limit of a few megabytes will be enough * to run many copies of PostgreSQL without needing to adjust system settings. SysV shmem can determine whether shared memory is still attached; mmap cannot This SysV shmem is used to protect the data directory; shared_buffers uses mmap (by default), not SysV This SysV shmem segment is tiny (from the virtual addresses we can see it\u0026rsquo;s just 4K = 2b61b0563000-2b61b0564000) Now look at the shm state enum:\ntypedef enum { SHMSTATE_ANALYSIS_FAILURE,\t/* unexpected failure to analyze the ID */ SHMSTATE_ATTACHED,\t/* pertinent to DataDir, has attached PIDs */ SHMSTATE_ENOENT,\t/* no segment of that ID */ SHMSTATE_FOREIGN,\t/* exists, but not pertinent to DataDir */ SHMSTATE_UNATTACHED\t/* pertinent to DataDir, no attached PIDs */ } IpcMemoryState; The key states are ATTACHED, FOREIGN, and UNATTACHED.\nThe SysV shmem protects the data directory — the common scenario is ensuring the directory isn\u0026rsquo;t running two instances. Since it\u0026rsquo;s shared memory, weird scenarios could mean the segment doesn\u0026rsquo;t belong to this directory or this process (FOREIGN state). If the shared memory corresponds to the data directory but no processes are running, it should be UNATTACHED. With processes running, it\u0026rsquo;s ATTACHED.\nNow look at the error thrown by PGSharedMemoryCreate:\nPGShmemHeader * PGSharedMemoryCreate(Size size, PGShmemHeader **shim) {... for (;;) // infinite loop {.. shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0);// shmget to fetch the SysV shmem and return its shmid if (shmid \u0026lt; 0) { oldhdr = NULL; state = SHMSTATE_FOREIGN; } else state = PGSharedMemoryAttach(shmid, NULL, \u0026amp;oldhdr);// determine this shmem segment\u0026#39;s state switch (state)// take different actions based on the shared memory state { ...// only showing 2 states here: attached and unattached case SHMSTATE_ATTACHED: // shm is attached — throw the error (this is the fault symptom we saw) ereport(FATAL, (errcode(ERRCODE_LOCK_FILE_EXISTS), errmsg(\u0026#34;pre-existing shared memory block (key %lu, ID %lu) is still in use\u0026#34;, (unsigned long) NextShmemSegID, (unsigned long) shmid), errhint(\u0026#34;Terminate any old server processes associated with data directory \\\u0026#34;%s\\\u0026#34;.\u0026#34;, DataDir))); break; ... case SHMSTATE_UNATTACHED:// shm is unattached /* * The segment pertains to DataDir, and every process that had * used it has died or detached. Zap it, if possible, and any * associated dynamic shared memory segments, as well. This * shouldn\u0026#39;t fail, but if it does, assume the segment belongs * to someone else after all, and try the next candidate. * Otherwise, try again to create the segment. That may fail * if some other process creates the same shmem key before we * do, in which case we\u0026#39;ll try the next key. */ // The segment belongs to the data directory, and no process still holds it if (oldhdr-\u0026gt;dsm_control != 0) dsm_cleanup_using_control_segment(oldhdr-\u0026gt;dsm_control); if (shmctl(shmid, IPC_RMID, NULL) \u0026lt; 0) NextShmemSegID++; // Note: ShmemSegID increments and retries break; } ... } ... } When shmem is ATTACHED, it throws the error. When unattached, it loops infinitely, trying to clean up the segment and incrementing ShmemSegID to request a new one.\nThe first case corresponds to this fault The second case corresponds to normal crash recovery (instance can still start after a crash) SysV shmem # From PG10 onwards, the postmaster.pid and SysV shmem logic was significantly reworked and has been largely stable since. This article only covers the PG10+ logic.\npidfile.h:\n#define LOCK_FILE_LINE_SHMEM_KEY\t7 sysv_shmem.c, InternalIpcMemoryCreate():\n{ char\tline[64]; sprintf(line, \u0026#34;%9lu %9lu\u0026#34;, (unsigned long) memKey, (unsigned long) shmid); AddToDataDirLockFile(LOCK_FILE_LINE_SHMEM_KEY, line); } From the source code, shmem info is saved on line 7 of postmaster.pid, containing the shmkey and shmid.\n\u0026gt; cat postmaster.pid 242712 /data 1772698474 8531 /tmp 0.0.0.0 4096 143917078 # \u0026lt;----here ready What Are shmkey and shmid? # In PG\u0026rsquo;s source, the call path is: InternalIpcMemoryCreate():\nshmid = shmget(memKey, 0, IPC_CREAT | IPC_EXCL | IPCProtection); PG uses shmkey/memkey as a seed key to request shared memory from the kernel, which returns a unique identifier, shmid.\nshmid is highly dependent on the server or rather the server\u0026rsquo;s memory state. For PG, when quickly restarting an instance, the shmid may be the same or +1 — this depends on Linux kernel internals. After a full server reboot, it\u0026rsquo;ll be completely different.\nTo aid understanding: whether the server reboots or not, shmkey/memkey can remain constant (since it\u0026rsquo;s user/PG input). But across a server reboot, even with the same shmkey, the returned shmid is very unlikely to be the same value.\nHow PG Obtains the shmkey # PGSharedMemoryCreate():\n/* * We use the data directory\u0026#39;s ID info (inode and device numbers) to * positively identify shmem segments associated with this data dir, and * also as seeds for searching for a free shmem key. */ if (stat(DataDir, \u0026amp;statbuf) \u0026lt; 0) ereport(FATAL, (errcode_for_file_access(), errmsg(\u0026#34;could not stat data directory \\\u0026#34;%s\\\u0026#34;: %m\u0026#34;, DataDir))); ... /* * Loop till we find a free IPC key. Trust CreateDataDirLockFile() to * ensure no more than one postmaster per data directory can enter this * loop simultaneously. (CreateDataDirLockFile() does not entirely ensure * that, but prefer fixing it over coping here.) */ NextShmemSegID = statbuf.st_ino; for (;;) { IpcMemoryId shmid; PGShmemHeader *oldhdr; IpcMemoryState state; /* Try to create new segment */ memAddress = InternalIpcMemoryCreate(NextShmemSegID, sysvsize); if (memAddress) break;\t/* successful create and attach */ /* Check shared memory and possibly remove and recreate */ /* * shmget() failure is typically EACCES, hence SHMSTATE_FOREIGN. * ENOENT, a narrow possibility, implies SHMSTATE_ENOENT, but one can * safely treat SHMSTATE_ENOENT like SHMSTATE_FOREIGN. */ shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0); PG calls stat() on the data directory, which returns the directory\u0026rsquo;s inode. PG directly uses datadir.inode as the shmkey.\nIn PG, the shmem key is tightly coupled to the data directory\u0026rsquo;s inode. Under normal circumstances, shmem key = datadir inode.\nVerification example:\n\u0026gt; ls -id $PGDATA 4096 /lzlcloud/pg8574/data \u0026gt; cat postmaster.pid |head -7|tail -1 4096 143917090 We can see datadir.inode = shmkey = 4096.\nPG shmkey in Cloud Environments # Above I said generally shmkey = datadir.inode, but in cloud environments this is typically not the case.\nOur cloud environment:\n\u0026gt; ls -id /lzlcloud/pg8298/data 4096 /lzlcloud/pg8298/data \u0026gt; ls -id /lzlcloud/pg8388/data 4096 /lzlcloud/pg8388/data \u0026gt; ls -id /lzlcloud/pg8095/data 4096 /lzlcloud/pg8095/data \u0026gt; cat /lzlcloud/pg8298/data/postmaster.pid|head -7|tail -1 4096 971833391 \u0026gt; cat /lzlcloud/pg8388/data/postmaster.pid|head -7|tail -1 4097 62128161 \u0026gt; cat /lzlcloud/pg8095/data/postmaster.pid|head -7|tail -1 4098 143163441 The data disk directories all have inode 4096, but the shmkeys are 4096, 4097, 4098.\nWhy?\nThe inode issue relates to the filesystem:\nEach filesystem has independent inodes The filesystem reserves some inodes — the first few are unusable. Depending on mount options, our data disk\u0026rsquo;s real inodes start at 4096 So datadir.inode = 4096 is the default behavior of our cloud environment\u0026rsquo;s disk mounts. Other environments may differ — I haven\u0026rsquo;t analyzed those deeply. But with the same filesystem and mount approach for PG data directories, inode collisions are still possible.\nThe shmkey issue relates to PG\u0026rsquo;s source code, PGSharedMemoryCreate():\nfor (;;) { ... NextShmemSegID = statbuf.st_ino; ... shmid = shmget(NextShmemSegID, sizeof(PGShmemHeader), 0); ... switch (state) { case SHMSTATE_FOREIGN: NextShmemSegID++; break; The initial shmkey = datadir.inode, but since the requested shmem might be FOREIGN (used by another process), PG increments shmkey by 1 and tries again.\nFor example, the instance with shmkey=4097 in postmaster.pid: at startup it tried shmkey=4096, but found that shmid\u0026rsquo;s memory segment was already in use by another instance (the one with shmkey=4096). So it used shmkey+1 to request a different shmid segment.\nSimilarly, the instance with shmkey=4098 had to increment twice to find a free shmkey-shmid pair.\nshmid Relationships # The SysV shmid can be found in the startup error log, line 7 of postmaster.pid, and virtual memory smaps. It can be inspected via the ipcs command and cleaned up with ipcrm.\nExample — note shmid=143917078 throughout:\nStartup error log:\npg_ctl: another server might be running; trying to start server anyway waiting for server to start....2026-03-05 16:02:19 CST::@:[262388]: FATAL: pre-existing shared memory block (key 4096, ID 143917078) is still in use postmaster.pid line 7:\n\u0026gt; cat postmaster.pid |head -7|tail -1 4096 143917078 Virtual memory smaps:\ncat /proc/`head -1 $PGDATA/postmaster.pid`/smaps | grep -E \u0026#34;\\-s\u0026#34; 2ad2b5189000-2ad2b518a000 rw-s 00000000 00:04 143917078 /SYSV00001000 (deleted) Inspecting and cleaning via SysV shmid:\nipcs -m -i 143917078 # cleanup: ipcrm -m shmid Shared memory Segment shmid=143917078 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=242712 cpid=242712 nattch=10 att_time=Thu Mar 5 16:14:51 2026 det_time=Thu Mar 5 16:14:49 2026 change_time=Thu Mar 5 16:14:34 2026 Testing # Reproducing the Production Issue # Hold a backend process alive indefinitely, then kill -9 the postmaster:\n\u0026gt; cat postmaster.pid 4096 143917076 \u0026gt; ipcs -m -i 143917076 # shmem id Shared memory Segment shmid=143917076 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=241567 cpid=64757 nattch=23 \u0026gt; kill -stop 107648 # any backend \u0026gt; kill -9 64757 # postmaster or another process \u0026gt; ipcs -m -i 143917076 Shared memory Segment shmid=143917076 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=252283 cpid=64757 nattch=1 # nattch != 0 \u0026gt; pg_ctl start -D $PGDATA pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2026-03-05 16:02:19 CST::@:[262388]: FATAL: pre-existing shared memory block (key 4096, ID 143917076) is still in use 2026-03-05 16:02:19 CST::@:[262388]: HINT: Terminate any old server processes associated with data directory \u0026#34;/data\u0026#34;. stopped waiting pg_ctl: could not start server nattch=1 — the instance cannot start.\nNormal Crash Recovery (Successful Startup) # Essentially, kill the instance and then start it:\n\u0026gt; cat postmaster.pid 4096 143917077 \u0026gt; ipcs -m -i 143917077 # shmem id Shared memory Segment shmid=143917077 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=154800 cpid=134329 nattch=18 \u0026gt; kill -9 134329 # postmaster or another process \u0026gt; cat postmaster.pid 4096 143917077 \u0026gt; ipcs -m -i 143917077 # shmem id unchanged, segment still exists Shared memory Segment shmid=143917077 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=169360 cpid=134329 nattch=0 # nattch=0 \u0026gt; ipcs -m -i 143917077 # shmem id unchanged, segment still exists \u0026gt; pg_ctl start -D $PGDATA # startup succeeds pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2026-03-05 16:14:34 CST::@:[242712]: LOG: redirecting log output to logging collector process 2026-03-05 16:14:34 CST::@:[242712]: HINT: Future log output will appear in directory \u0026#34;/data/pg_log\u0026#34;. done server started \u0026gt; ipcs -m -i 143917077 # residual shmem cleaned up during startup ipcs: id 143917077 not found \u0026gt; ipcs -m -i 143917078 # shmid incremented by 1 at startup Shared memory Segment shmid=143917078 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=273571 cpid=242712 nattch=26 \u0026gt; cat postmaster.pid # shmkey unchanged, shmid +1 4096 143917078 A normal kill -9 followed by startup works fine — the residual shmem is cleaned up during startup. shmkey stays the same because inode=4096 and shmkey=4096 wasn\u0026rsquo;t occupied. shmid+1 is Linux kernel behavior, at least indicating a different shared memory segment was used.\nHolding a File Descriptor But Not shmem # Since startup is tied to the data directory inode, and inode is tied to shmem id, startup essentially checks whether the shmem is held by another process, not whether a file descriptor is still open. So let\u0026rsquo;s test with the logger process, which holds file descriptors but not shared memory:\n$ cat /proc/77300/smaps | grep -E \u0026#34;\\-s\u0026#34; # logger process — verify it has no shared memory $ kill -stop 77300 # stop logger $ kill -9 77076 # kill -9 pm $ cat postmaster.pid # file still exists 77076 /lzlcloud/pg8531/data 1772700343 8531 /tmp 0.0.0.0 4096 143917080 ready $ ipcs -m -i 143917080 # shared memory still exists Shared memory Segment shmid=143917080 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=77319 cpid=77076 nattch=0 att_time=Thu Mar 5 17:27:11 2026 det_time=Thu Mar 5 17:27:15 2026 change_time=Thu Mar 5 16:45:43 2026 $ ps -ef|grep 77300 # process still alive postgres 77300 1 0 16:45 ? 00:00:00 postgresql: lzldb: logger postgres 135246 46622 0 17:27 pts/1 00:00:00 grep --color=auto 77300 $ pg_ctl start -D $PGDATA # startup succeeds pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2026-03-05 17:27:55 CST::@:[140497]: LOG: redirecting log output to logging collector process 2026-03-05 17:27:55 CST::@:[140497]: HINT: Future log output will appear in directory \u0026#34;/data/pg_log\u0026#34;. done server started The logger holds files in the data directory but is not associated with shared memory — it does not block startup.\nDeleting postmaster.pid Then Failing to Start # Same procedure: hold a backend process, kill -9 the PM, delete postmaster.pid, attempt startup.\nI\u0026rsquo;ll skip the full output — result: startup fails with:\nwaiting for server to start....2026-03-06 15:29:48 CST::@:[22475]: FATAL: pre-existing shared memory block (key 4098, ID 171868173) is still in use 2026-03-06 15:29:48 CST::@:[22475]: HINT: Terminate any old server processes associated with data directory \u0026#34;/data\u0026#34;. 2026-03-06 15:29:48 CST::@:[22475]: LOG: database system is shut down This shows: even with a zombie process holding shmem, deleting the postmaster.pid (which contains the shmid) doesn\u0026rsquo;t stop PG from finding the corresponding shmid.\nStop a Different Instance, Start the Current One # PG analyzes shmid from two sources to determine if it belongs to the current instance:\nThe shmid corresponding to datadir.inode as shmkey, or after shmkey++ The shmid stored in postmaster.pid Even if postmaster.pid is deleted, PG can still tell whether shmem is held by another process. But we can exploit datadir.inode and shmkey++ behavior to get it started.\nSince in our cloud environment all data directory inodes are 4096, and shmkeys differ due to the shmkey++ source logic, we can: start or stop a PG instance whose datadir.inode = 4096 to shift the current instance\u0026rsquo;s shmkey++ by one, obtaining a different shmid.\n$ kill -stop 165245 $ kill -9 164411 # stop current instance, keep one of its backend processes alive $ pg_ctl stop -D /pg8531/data # stop a different instance waiting for server to shut down.... done server stopped $ pg_ctl start -D /pg8574/data # try starting the current instance — fails because postmaster.pid still exists pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2026-03-05 18:22:35 CST::@:[196209]: FATAL: pre-existing shared memory block (key 4097, ID 143917087) is still in use 2026-03-05 18:22:35 CST::@:[196209]: HINT: Terminate any old server processes associated with data directory \u0026#34;/pg8574/data\u0026#34;. stopped waiting pg_ctl: could not start server Examine the log output. $ mv /lzlcloud/pg8574/data/postmaster.pid{,.bak} # delete current instance\u0026#39;s postmaster.pid $ pg_ctl start -D /lzlcloud/pg8574/data # try again — succeeds 2026-03-05 18:23:09 CST::@:[207725]: LOG: redirecting log output to logging collector process 2026-03-05 18:23:09 CST::@:[207725]: HINT: Future log output will appear in directory \u0026#34;/lzlcloud/pg8574/data/pg_log\u0026#34;. done server started $ ipcs -m -i 143917087 # the shmid\u0026#39;s SysV segment is still held by our zombie process Shared memory Segment shmid=143917087 uid=6001 gid=6001 cuid=6001 cgid=6001 mode=0600 access_perms=0600 bytes=56 lpid=196209 cpid=164411 nattch=1 att_time=Thu Mar 5 18:22:35 2026 det_time=Thu Mar 5 18:22:35 2026 change_time=Thu Mar 5 18:21:04 2026 Startup succeeds — the current instance requested a different shared memory segment. The old segment wasn\u0026rsquo;t cleaned up. This is the \u0026ldquo;hack\u0026rdquo; of stopping another instance to start the current one in a cloud environment.\nA small prerequisite: the other instance must have not only inode = current instance inode, but also shmkey \u0026lt; current instance shmkey.\nError Analysis: lock file \u0026quot;postmaster.pid\u0026quot; already exists # This problem is much simpler than the shared memory one.\nDuring startup, PG checks the lock file and its contained PID, in CreateLockFile():\nif (other_pid != my_pid \u0026amp;\u0026amp; other_pid != my_p_pid \u0026amp;\u0026amp; other_pid != my_gp_pid) { if (kill(other_pid, 0) == 0 || (errno != ESRCH \u0026amp;\u0026amp; errno != EPERM)) { /* lockfile belongs to a live process */ ereport(FATAL, (errcode(ERRCODE_LOCK_FILE_EXISTS), errmsg(\u0026#34;lock file \\\u0026#34;%s\\\u0026#34; already exists\u0026#34;, filename), isDDLock ? (encoded_pid \u0026lt; 0 ? errhint(\u0026#34;Is another postgres (PID %d) running in data directory \\\u0026#34;%s\\\u0026#34;?\u0026#34;, (int) other_pid, refName) : errhint(\u0026#34;Is another postmaster (PID %d) running in data directory \\\u0026#34;%s\\\u0026#34;?\u0026#34;, (int) other_pid, refName)) : (encoded_pid \u0026lt; 0 ? errhint(\u0026#34;Is another postgres (PID %d) using socket file \\\u0026#34;%s\\\u0026#34;?\u0026#34;, (int) other_pid, refName) : errhint(\u0026#34;Is another postmaster (PID %d) using socket file \\\u0026#34;%s\\\u0026#34;?\u0026#34;, (int) other_pid, refName)))); } } Testing is even simpler — just start it a second time while already running:\n$ pg_ctl start -D /pg8531/data pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2026-03-06 15:59:05 CST::@:[89145]: FATAL: lock file \u0026#34;postmaster.pid\u0026#34; already exists 2026-03-06 15:59:05 CST::@:[89145]: HINT: Is another postmaster (PID 255500) running in data directory \u0026#34;/pg8531/data\u0026#34;? stopped waiting pg_ctl: could not start server Examine the log output. So the later errors in the fault\u0026rsquo;s start.log were because the instance was already running and someone tried starting it multiple more times.\nSummary # When starting, PG first allocates a SysV shmem segment (not the mmap-based shared_buffers) to lock the data directory. The lock is obtained by using the data directory\u0026rsquo;s inode as the shmkey passed to shmget(), which returns a unique shmid. Since the requested shmem may already be in use by another process, PG increments shmkey++ in an infinite loop until it finds an unclaimed segment. postmaster.pid line 7 stores both the shmkey and shmid. In cloud environments, you\u0026rsquo;ll often see adjacent PG instances with incrementing shmkeys — this happens because the data disks are mounted identically and share the same starting inode, causing shmkey++ to kick in.\nIf a PG instance is killed unexpectedly, the shmem is not automatically cleaned up. Under normal conditions, no zombie process holds the shared memory, so startup cleans it up and proceeds normally. Under abnormal conditions, a zombie process still holds the shared memory — startup fails and manual intervention is required.\nRecommended handling:\nipcrm -m (most recommended) Use lsof to find the zombie process and kill it Reboot the host Not recommended but possible workarounds:\nmv postmaster.pid + stop a different PG instance (where the other instance\u0026rsquo;s shmkey \u0026lt; current instance\u0026rsquo;s shmkey) mv postmaster.pid + remount the data disk to change its inode Finally, answering the opening questions:\nWhy isn\u0026rsquo;t this scenario more common in practice? Abnormal instance crash + zombie processes still alive. Many crash scenarios leave no zombie processes, so startup just works.\nThe start.log shows two different error types — what do they correspond to? The \u0026ldquo;shared memory in use\u0026rdquo; error means abnormal crash + zombie processes still exist. The \u0026ldquo;postmaster.pid already exists\u0026rdquo; error means the instance was started multiple times.\nCan shared memory still exist if the postmaster is gone? Yes, shared memory can persist when the postmaster is gone — PG processes don\u0026rsquo;t always cleanly exit or get cleaned up by the OS. However, if all processes are gone, the shared memory should not exist.\nHow do you locate and clean up this shared memory segment? The shmid can be found in the startup error log (start.log). Clean it with ipcrm -m $shmid.\nPG has multiple shared memory segments — which one is this? The SysV shmem used to protect the data directory. It always exists. See the \u0026ldquo;Three Types of Shared Memory\u0026rdquo; section. It\u0026rsquo;s distinct from the mmap-based shared_buffers.\nCan you find the corresponding shmem via inode or file? Linux does not provide a userspace interface to find SysV shmem by inode or file (this statement is 100% AI-generated, cross-validated across multiple models). PG uses the data directory\u0026rsquo;s inode as a seed shmkey to request shared memory — it does not directly find shmem by inode. PG has its own mechanism for locating SysV shmem, but it\u0026rsquo;s not an absolute mapping; shmkey++ is a compromise startup logic for this reason.\n","date":"Mar 9, 2026","externalUrl":null,"permalink":"/en/2026/03/09/case-study-startup-failure-and-sysv-shared-memory/","section":"Posts","summary":"Problem Symptoms # The database instance’s RSS memory was maxed out, OOM messages appeared in the logs, and the instance died. We won’t analyze the OOM cause here.\nBut startup kept failing — 4 or 5 attempts according to the logs:\n2026-02-12 09:15:21 CST::@:[578272]: FATAL: pre-existing shared memory block (key 2048, ID 1328250881) is still in use 2026-02-12 09:15:21 CST::@:[578272]: HINT: Terminate any old server processes associated with data directory \"/data\". 2026-02-12 09:15:21 CST::@:[578272]: LOG: database system is shut down 2026-02-12 09:21:03 CST::@:[658824]: FATAL: pre-existing shared memory block (key 2048, ID 1328250881) is still in use 2026-02-12 09:21:03 CST::@:[658824]: HINT: Terminate any old server processes associated with data directory \"/data\". 2026-02-12 09:21:03 CST::@:[658824]: LOG: database system is shut down 2026-02-12 09:31:12 CST::@:[794791]: LOG: redirecting log output to logging collector process 2026-02-12 09:31:12 CST::@:[794791]: HINT: Future log output will appear in directory \"/data/pg_log\". 2026-02-12 09:31:37 CST::@:[801049]: FATAL: lock file \"postmaster.pid\" already exists 2026-02-12 09:31:37 CST::@:[801049]: HINT: Is another postmaster (PID 794791) running in data directory \"/data\"? 2026-02-12 09:32:34 CST::@:[814396]: FATAL: lock file \"postmaster.pid\" already exists 2026-02-12 09:32:34 CST::@:[814396]: HINT: Is another postmaster (PID 794791) running in data directory \"/data\"? Startup succeeded after the DBA ran ipcrm -m xxx before starting.\n","title":"Case Study: Startup Failure and SysV Shared Memory","type":"posts"},{"content":"","date":"Mar 9, 2026","externalUrl":null,"permalink":"/en/categories/postgresql%E6%A1%88%E4%BE%8B/","section":"Categories","summary":"","title":"PostgreSQL案例","type":"categories"},{"content":"AI rate: This article has approximately 60% AI involvement, with about 20 rounds of battling with AI\nRecommendation reason: Contains some reflections and insights on AI Ops, hence recommended\nWriting in the AI Era # For authors who write blogs or WeChat public accounts, AI may be a fatal blow, because AI writing is simply too easy. As someone who writes articles myself, I have many internal struggles about how AI affects writing habits, and it pains me too. Let me revisit some earlier thoughts on writing:\nWhy write?\nFor myself: To consolidate knowledge. Output is what strengthens input. Glancing at something once versus writing it out are completely different experiences — writing can take several times longer than just reading. For example, when you see a profound and seemingly familiar sentence, rewriting it yourself reveals countless details within it. For myself: To leverage others\u0026rsquo; biases constructively. Mainly to use readers\u0026rsquo; expectations as motivation to persist in writing and to enhance the credibility of content. Knowledge you consume yourself may be \u0026ldquo;good enough,\u0026rdquo; but writing for a public audience forces you to weigh every word and take responsibility for others. (Relatively speaking — not actual word-by-word scrutiny.) For myself: To build reputation. This depends heavily on writing quality. For others/the community: To spread knowledge. Good things should be shared and used by everyone — this is at the core of the PostgreSQL open-source community. Encouraging sharing, not hoarding, is a principle I\u0026rsquo;ve always upheld. Building connections: This wasn\u0026rsquo;t my goal, but I have indeed met some friends through it. Human writing was already difficult; in the AI era, human writing is essentially Hell Mode — like walking against the current without a destination, unable to see any light, while everyone else is heading the opposite direction. I\u0026rsquo;ve certainly experienced AI-powered interpretation, translation, and article generation, but it never feels like mine, or it loses the original purpose of training myself. Or, at a deeper level, I want to feel the vitality of the work.\nThe DBA community\u0026rsquo;s articles can be described as a mixed bag — people write about everything. I\u0026rsquo;ve always preferred substantive, content-rich articles focused on PostgreSQL internals and operations, like those by Cancan and Xiangbo — I eagerly anticipate every piece and read them carefully. Generally, content-oriented articles don\u0026rsquo;t get much traffic (both Cancan and Xiangbo have complained about this on their public accounts\u0026hellip;), and I\u0026rsquo;m quite easygoing about it myself.\nHowever, my previous article \u0026ldquo;PG Operations Database Operations Experience 2025\u0026rdquo; gained a surprising number of followers, which truly astonished me. So I\u0026rsquo;ve been pondering this question for days: Why would a non-AI-written, non-comprehensive, DBA-focused, knowledge-oriented article attract so much interest? What does AI mean for DBAs?\nReflections on Operations # The Essence of Operations and AI Ops # Operations involve many things. To narrow the scope of discussion, I\u0026rsquo;ll focus on just one small part of operations work — incident response — to interpret the essence of DB Ops. First, my position: \u0026ldquo;Operations is not merely a technical problem.\u0026rdquo;\nMany people argue that since both humans and AI make mistakes, AI can be given authority to act boldly — specifically, if AI\u0026rsquo;s error rate ≤ human error rate, replacement is justified. I thought the same two years ago, but I no longer do. Because the real-world environment is far more complex, with at least the following factors to consider:\nThe consensus problem. There is consensus that a DBA might accidentally delete data, but another consensus is easily overlooked: in normal circumstances, the team assumes the DBA won\u0026rsquo;t delete data. How to understand this? For example, when hiring a DBA, a responsible team will assess whether the person is mentally stable, then default to assuming they won\u0026rsquo;t delete data, and maintain this assumption throughout long-term work. At the very least, I don\u0026rsquo;t constantly worry that my colleague will drop the database. But when \u0026ldquo;hiring\u0026rdquo; an AI DBA, it has no mental state, and no one assumes it won\u0026rsquo;t delete data. \u0026ldquo;It will delete data\u0026rdquo; is everyone\u0026rsquo;s consensus, creating deployment resistance. The importance of data. C-end (consumer) data and B-end (business) data have different importance levels. Retail, internet, government, and financial industry data also differ in criticality. The more an industry values data, the more sensitive it is to data reliability and business continuity. A personal computer has no business continuity and only one person cares about data reliability, but in the financial industry, business continuity can directly trigger widespread social concern — financial data reliability simply cannot be questionable. AI Ops deployment must consider system criticality; it cannot be rolled out across all domains simultaneously. The management system. For example, in financial systems, DBAs hold high privileges and are governed by a set of management procedures. So shouldn\u0026rsquo;t an AI DBA also have corresponding management procedures before it can be deployed? What about abnormal login detection, or abnormal backend access? How does it request permissions, and for how long? What level of permission in what scenario? These are all unresolved issues. AI\u0026rsquo;s own security. For instance, the paper STRATUS mentions prompt injection attacks, for which there is currently no effective solution. If someone injects a \u0026ldquo;drop database\u0026rdquo; prompt, it might just execute it. But humans basically don\u0026rsquo;t have this problem — if you tell a DBA \u0026ldquo;drop database,\u0026rdquo; they\u0026rsquo;ll just ask you what you\u0026rsquo;re trying to do. The responsibility problem. Operations engineering is not a \u0026ldquo;knowledge problem\u0026rdquo; but a \u0026ldquo;responsibility problem.\u0026rdquo; One of the core tasks of operations is to make irreversible decisions about the system within limited time during an incident, and take responsibility for those actions. AI can replace \u0026ldquo;formalizable operations\u0026rdquo; but cannot replace \u0026ldquo;judgments that must bear consequences\u0026rdquo; — at least not yet. Full of noise. Operations is an \u0026ldquo;open system,\u0026rdquo; not a closed reasoning system. Databases run in extremely complex environments, while AI\u0026rsquo;s reasoning premise is that the world can be described in text. But the real operations world is filled with noise, contingency, and undocumented behaviors. Situational pressure. Real business environments include recovery time pressure, organizational and customer emotional management, etc. The book Google SRE describes a common recovery scenario: customers asking when it will be restored, leadership asking why failover hasn\u0026rsquo;t happened, engineers gathering various information under pressure while calling people to confirm recovery procedures. AI cannot feel this pressure. The first two questions are fundamentally not technical problems, but they must be answered. In real scenarios, the answers at that moment are likely to be rough at best. Let\u0026rsquo;s imagine what conditions would be needed for fully automated AI operations to truly happen:\nAI won\u0026rsquo;t destroy critical data — at least, the vast majority of people need to reach this consensus about AI. Complete management procedures are needed, including how to grant AI permissions, just like how we grant DBA permissions. Solve the problem of AI itself being attacked. Not just LLMs, but the entire IT system encompassing AI. A no-blame operations culture (or eliminating operations altogether is another approach). Accept erroneous judgments. Form consensus around the existence of noise and environment, and tolerate AI Ops iteration cycles. If recovery takes too long or the blast radius expands, don\u0026rsquo;t allow human intervention — because if human intervention is required, that person is still the operator (semi-automated AI Ops?). Pressure-free recovery context. This means leaders, customers, and public opinion don\u0026rsquo;t need responses, or they trust some AI\u0026rsquo;s response. This is a human transformation, not an IT system transformation. AIOps and Agent Research Results # The Tsinghua AIDB repository\u0026rsquo;s directory contains many AI4DB papers — too many for a person to read. I used NotebookLM to summarize the paper categories:\nAgain, to narrow the scope (mainly to reduce my own effort), let\u0026rsquo;s focus on database diagnostics content.\nAIOps has made decent academic progress. AIOps research integrates machine learning, reinforcement learning, and large language models into database management, covering key tasks such as parameter tuning, index recommendations, query optimization, and fault diagnosis. The goal is to build \u0026ldquo;self-driving\u0026rdquo; database systems with self-awareness and self-healing capabilities. While significantly improving complex workload performance and operational efficiency, this also drives the DBA\u0026rsquo;s transformation from low-efficiency manual intervention to high-level architectural supervision.\nRegarding whether \u0026ldquo;DBAs will be eliminated,\u0026rdquo; current research trends and industry practices (especially self-driving databases and LLM applications) show that the DBA role is undergoing a profound transformation from \u0026ldquo;manual operator\u0026rdquo; to \u0026ldquo;senior manager/supervisor,\u0026rdquo; rather than simple replacement. The DBA\u0026rsquo;s core value will shift toward managing AI operations strategies, ensuring data security and compliance, and handling extreme anomaly scenarios that AI cannot resolve.\nAnother AI Ops Frontier Survey article describes Agents this way:\n\u0026ldquo;This shows that AI Agents are not a silver bullet. To apply Agents, we need not only progress at the model and agent level, but also sufficient support capabilities from the entire operational system — such as Kubernetes-like declarative interfaces, good observability, and reversible operation design. Stratus\u0026rsquo;s preliminary experiments demonstrate the potential of Agents in automated operations, but there remain enormous gaps in performance, reliability, and security before production deployment.\u0026rdquo;\nThe development domain, fueled by the booming vibe coding movement, is clearly advancing much faster than AI in operations. I\u0026rsquo;d also love to have a confirm/redo operations remote control — the problem is, it doesn\u0026rsquo;t exist yet. Even if we fantasize about \u0026ldquo;vibe maintaining\u0026rdquo; one day, I doubt many ops people would turn on yolo mode.\nThe Value of a DBA # Is a DBA\u0026rsquo;s Value Just Being the Decision-Maker and Scapegoat? # Endorsement indeed seems to be something AI cannot solve. So is the DBA\u0026rsquo;s value just being the decision-maker and scapegoat? After all, a DBA\u0026rsquo;s knowledge is far less than AI\u0026rsquo;s — it\u0026rsquo;s just that AI can\u0026rsquo;t make the final call.\n1. Instantaneous Context\n\u0026ldquo;The DBA\u0026rsquo;s knowledge is far less than AI\u0026rsquo;s\u0026rdquo; — this is true for general knowledge (like how to optimize a SQL query, or the meaning of a configuration parameter). But AI lacks instantaneous runtime context. AI knows database principles, but it doesn\u0026rsquo;t know the accumulated historical debt hiding behind the load balancer during the sudden traffic spike of your company\u0026rsquo;s Double Eleven (Singles\u0026rsquo; Day). The DBA possesses unstructured experience about \u0026ldquo;this specific machine, this specific business, these specific people.\u0026rdquo; In the face of extreme failures, AI offers the \u0026ldquo;highest-probability suggestion,\u0026rdquo; while the DBA offers \u0026ldquo;the operation that best preserves the system\u0026rsquo;s life under this specific pressure.\u0026rdquo;\n2. The Last Gate of a Chaotic System\nThe database is the most fragile and least fault-tolerant part of all IT architectures (code can be rolled back, but data loss can bankrupt a company). AI\u0026rsquo;s logic is extrapolation based on historical data. When encountering unprecedented underlying hardware bad sectors, extremely rare distributed deadlocks, or novel hacker attack methods, AI\u0026rsquo;s \u0026ldquo;suggestions\u0026rdquo; often fail or even cause secondary damage. The core of \u0026ldquo;making the call\u0026rdquo; is not \u0026ldquo;which solution to choose,\u0026rdquo; but \u0026ldquo;hedging against risk.\u0026rdquo; This kind of control over extreme situations is something current AI cannot provide.\n3. Chain of Trust\nThe DBA is the maintainer of the chain of trust: for example, if you let AI audit AI, then who audits the AI\u0026rsquo;s audit logic? At the levels of data security, compliance, and ethics, there must be a human with the highest privileges who can be held accountable as the endpoint of the trust chain.\nLet\u0026rsquo;s flip the perspective: if DBAs really were just \u0026ldquo;less knowledgeable decision-makers and scapegoats,\u0026rdquo; then enterprises would have long ago transferred DBA decision-making authority to SREs, architecture committees, or even AI and other responsible entities. But the reality is, at truly critical moments, enterprises still call \u0026ldquo;that person.\u0026rdquo; This shows the question was never \u0026ldquo;who is smarter,\u0026rdquo; but who can bear the consequences for the organization amid uncertainty. The DBA is the last human in this chaotic database system who holds the authority to stop losses, the responsibility, and the terminal point of trust.\nSo is every decision made by the DBA? Obviously not. The DBA does not hold \u0026ldquo;objective decision-making authority\u0026rdquo; but rather \u0026ldquo;risk veto power\u0026rdquo; — they cannot decide whether the business should take risks, but they can determine which risks the system cannot bear. In simple, low-risk, rollback-able scenarios, decisions are often made automatically by processes or systems; only when decisions enter high-risk, irreversible territory where responsibility must converge is the DBA pushed to the forefront.\nThe Uniqueness of the Postgres DBA # For the specific group of Postgres (PG) DBAs, this uniqueness is even more pronounced.\nIn modern technical organizations, DBAs do not naturally hold architectural decision-making authority, nor do they monopolize index or parameter formulation. Architects can design solutions, developers can write SQL, and AI can even provide seemingly comprehensive best-practice recommendations. But these decisions mostly occur at the abstraction layer, design layer, and probability layer — they assume the system is rollback-able, replay-able, and correctable.\nPostgres\u0026rsquo;s uniqueness lies in the fact that it hands a great deal of freedom to its users, and these freedoms ultimately translate into long-term side effects in real systems: write amplification, I/O pattern changes, Vacuum imbalance, WAL bloat, and unpredictable performance degradation. These side effects cannot be fully rehearsed at the design stage, cannot be subcontracted to a single role, and cannot simply be \u0026ldquo;withdrawn\u0026rdquo; after an incident occurs. When the system enters an unstoppable, unreplayable state, the only person still responsible for the overall outcome is often the DBA.\nTherefore, the value of a Postgres DBA lies not in \u0026ldquo;making decisions for others\u0026rdquo; (though you certainly can), but in continuously managing the real-world consequences of all decisions after they have already been made. \u0026ldquo;Architects define the ideal, developers implement functionality, AI predicts the future; and the DBA guards reality.\u0026rdquo;\nThis ability to guard reality is based on the PG DBA having sufficient understanding of Postgres, sufficient understanding of the system\u0026rsquo;s real environment, sufficient understanding of the system\u0026rsquo;s history, and sufficient immediate context. In the AI era, one more thing needs to be added: sufficient understanding of AI.\nWhy Keep Learning # In the past two years, I\u0026rsquo;ve heard \u0026ldquo;learning is useless\u0026rdquo; rhetoric more than ever before. I generally scoff at such talk. Let me take this opportunity to properly address it.\nDoes foundational database knowledge still have value? The answer is: its value is higher than ever. Let\u0026rsquo;s interpret this from three angles: the right to explain, active learning, and why I keep revisiting the classics.\n1. The Right to Explain\nFoundational knowledge enables three things:\nIdentifying \u0026ldquo;systemic inevitable failure points\u0026rdquo; in advance Clearly articulating the judgment logic Transforming \u0026ldquo;I\u0026rsquo;m going with my gut\u0026rdquo; into \u0026ldquo;this is the system-determined outcome\u0026rdquo; The true meaning of learning database fundamentals is not to \u0026ldquo;do more work,\u0026rdquo; but to:\nDelineate responsibility boundaries Enhance discourse power Let the system endorse your judgments 2. Active Learning Becomes an Even Rarer Ability\nIn the AI era, the \u0026ldquo;technical barrier\u0026rdquo; to knowledge acquisition approaches zero. Active learning ability is scarce. Why is \u0026ldquo;active learning\u0026rdquo; even rarer in the AI era? This is counter-intuitive but very real. AI makes \u0026ldquo;passive learning\u0026rdquo; extremely comfortable — ask and answer anytime, no long-term investment required, no need to endure cognitive discomfort. But the result is that more and more people stay in the \u0026ldquo;instant gratification layer,\u0026rdquo; unwilling to learn foundational knowledge anymore. When everyone else is regressing, you find yourself advancing faster.\n3. Why Do I Keep Re-reading Classics like The Internals of PostgreSQL?\nTechnical people need to read books because books don\u0026rsquo;t just give answers — they help build a cognitive model that can be run repeatedly and continuously refined. AI is currently better at answering questions rather than shaping such models. AI struggles to become this kind of \u0026ldquo;long-term dialogue partner\u0026rdquo; — its answers are unstable.\nFrom another perspective, looking at the value of books through cognitive economics + information theory + token cost: First, you don\u0026rsquo;t need to battle with AI back and forth. The real cost of battling is not money, but your attention and context-maintenance ability. Second, the hundreds of thousands of words in a book require neither massive prompt input nor excessive token expenditure from you. Third, the knowledge in books has been repeatedly verified by authors and readers — it is already compressed knowledge, the easiest to learn. So: Classic books = extremely low token cost to obtain high-density, human-repeatedly-verified, focused knowledge in compressed form.\n4. Learning AI Itself\nThis needs no elaboration from me.\nMy battles with AI led me to an interesting conclusion about \u0026ldquo;why read books\u0026rdquo;:\nIn the AI era, knowledge is cheap, but judgment is expensive =\u0026gt; And judgment comes from a stable, calibratable cognitive model =\u0026gt; A stable cognitive model is itself a byproduct of \u0026ldquo;long-term high-quality knowledge intake.\u0026rdquo;\nAnother real-world piece of evidence supporting the \u0026ldquo;reading is useful\u0026rdquo; argument: this very article depends on books, papers, other articles, and information I\u0026rsquo;ve read. Without that foundation, this article would not exist.\nWhy People Still Love Reading \u0026ldquo;Human-Written\u0026rdquo; Articles # The Psychology of Preferring Imperfection # Human-written technical articles are inevitably riddled with flaws. Looking back at my own \u0026ldquo;Operations Experience 2024\u0026rdquo; from last year, I can find many holes. Even \u0026ldquo;Operations Experience 2025,\u0026rdquo; completed just days ago, I consider incomplete — vastly different from something AI would write. So why do readers still enjoy such flawed technical articles?\nThe reason may be that humans are not attracted by \u0026ldquo;information correctness,\u0026rdquo; but by \u0026ldquo;empathetically imperfect traces of a human mind.\u0026rdquo; From the book A Brief History of Intelligence, we know that the human intelligence model inherently includes self-trial-and-error exploration and observing others\u0026rsquo; behaviors to map onto oneself — this is a learning process, innate to humans. Our brains automatically scan text for hesitation, uncertainty, logical gaps, awkward expressions, emotional leakage, etc. — all things systematically absent from AI text. In the imperfections and emotions of writing, readers can feel the author\u0026rsquo;s thinking and emotions, whereas AI merely presents results. Readers almost never have emotional resonance with AI. Generally speaking, only those who have truly experienced something leave these \u0026ldquo;unattractive\u0026rdquo; traces.\nSo I believe many people, like me, can identify purely AI-written technical articles at a glance (not guaranteed 100% accurate) and generally won\u0026rsquo;t have the emotional drive to read through them. But if it\u0026rsquo;s something a human has seriously written, they\u0026rsquo;ll read carefully, feeling the author\u0026rsquo;s feelings, catching their shortcomings or contextual gaps.\nOf course, authors could feed prompts to mimic their previous writing style or deliberately leave flaws. But I haven\u0026rsquo;t seriously tried this — I briefly generated a few pieces and felt the emotional immersion was still quite poor. I don\u0026rsquo;t plan to explore this further; there\u0026rsquo;s not much point.\nBorrowing an Expert\u0026rsquo;s ATTENTION for Free # The core difference between AI articles and expert articles is not \u0026ldquo;how well they\u0026rsquo;re written,\u0026rdquo; but a matter of economic questioning and industry-leading Attention. Expert writing is about allocating attention on behalf of the reader; AI writing is about avoiding missing any potentially relevant information. This is not a capability issue — it\u0026rsquo;s a difference in objective functions. Truly high-value technical articles don\u0026rsquo;t tell you all the correct answers — they block 80% of what you shouldn\u0026rsquo;t be paying attention to right now.\nWhy do experts dare to \u0026ldquo;delete,\u0026rdquo; while AI doesn\u0026rsquo;t? Because they bear cognitive responsibility for your understanding outcomes. AI does not bear the consequences of you learning or applying things wrong. So experts deliberately filter out details that don\u0026rsquo;t need attention right now. This filtering is itself the value of expertise. For humans, the bottleneck in learning is not insufficient information, but limited attention and not knowing where to look first. An expert\u0026rsquo;s article directly hands you the result and says: just focus on this. But when facing an LLM, do you know what to look for?\nThis is not a denial of the value of AI articles. AI excels at \u0026ldquo;rapidly expanding the information space when you already know the problem boundaries,\u0026rdquo; while expert articles excel at \u0026ldquo;contracting the problem space for you before you\u0026rsquo;ve established a judgment framework.\u0026rdquo; The former is good for filling gaps and lateral expansion; the latter is good for building core understanding and key intuition. The truly efficient learning approach is not choosing one over the other, but first using experts to achieve Attention alignment, then using AI to do amplified search within the bounded space.\nThis isn\u0026rsquo;t saying AI content is useless or human-written content is useless — it\u0026rsquo;s that each has its own use.\nCan AGI Solve All Problems? # Refuting Musk # Recently Musk has been painting big pictures again. After reading, I don\u0026rsquo;t agree.\n1. Shared Prosperity or the Useless Class?\nThe \u0026ldquo;useless class\u0026rdquo; is a concept from Yuval Noah Harari\u0026rsquo;s Homo Deus. He argues that when AI\u0026rsquo;s productivity surpasses that of ordinary people, using AI to do work will replace having ordinary people do it. These people become the useless class. Resources will increasingly concentrate in the hands of a few elites and large corporations, and most people will lose their jobs — yet there is currently no effective policy to provide a safety net. This view happens to contradict the Musk-style shared prosperity vision. Musk believes that when AGI is realized, no one will need to worry about survival, education, or healthcare — productivity will be so high that governments will provide a safety net for most people. I currently support Harari\u0026rsquo;s view. In fact, from anecdotal perceptual statistics around us, the population of the useless class is indeed rising.\n2. Can High Productivity Create a Utopia?\nOne theory supporting my disagreement with the AGI utopia comes from another book, Evolutionary Psychology — Mate Selection Criteria. One particularly striking insight: Due to social division of labor and the biological drive to raise well-adapted offspring, men tend to prefer young, healthy women, while women tend to prefer healthy, resourceful men. This default filtering engraved in our genes means that humans cannot live equally — you don\u0026rsquo;t want to be the one eliminated. So if a non-comparing, non-competitive, resource-equal utopia could be sustained, productivity is merely one necessary condition among many — there are many other social problems that must be solved, which the public tends to overlook. This isn\u0026rsquo;t narrowly referring only to evolutionary psychology; some things haven\u0026rsquo;t been carefully discussed, such as the power struggles in Chimpanzee Politics, which should also be considered.\n3. Calhoun\u0026rsquo;s Mouse Utopia Experiment\nIn 1972, animal behaviorist John B. Calhoun designed and described in detail a famous experimental environment — \u0026ldquo;Universe 25.\u0026rdquo; This was a laboratory \u0026ldquo;utopia\u0026rdquo; specially crafted for mice, striving for perfection in almost every aspect: abundant food, water, and nesting materials; regularly cleaned living environment; no predator threats; temperature maintained between 20°C and 31°C via fans and heating, stable and comfortable.\nThe mouse population\u0026rsquo;s march toward extinction seems somewhat insane. I\u0026rsquo;ll focus only on the process: 1) Increased violence 2) No longer pursuing the opposite sex 3) Increased homosexual behavior 4) Increased solitary behavior 5) Males grooming themselves excessively 6) Apathy, etc. Of course, this experiment has flaws. From the intelligence model described in A Brief History of Intelligence, the intelligence gap between mice and humans still spans a primate — it cannot represent human society. But at minimum, it shows that utopia triggers new social problems; people won\u0026rsquo;t just quietly live their lives.\n4. An Economics-Based Society\nFrom the perspective of modern economics, whether AGI can achieve a \u0026ldquo;shared-prosperity utopia\u0026rdquo; can be divided into two types: retaining the modern economy or not retaining it.\nIf we retain the modern economy, AGI can be viewed as an extremely efficient \u0026ldquo;universal factor of production.\u0026rdquo; It significantly reduces the costs of knowledge production, decision support, organizational coordination, and marginal labor, raising the ceiling of society-wide productivity. Under this premise, wealth distribution, public service provision, and social security mechanisms still rely on markets, price signals, incentive structures, and institutional constraints. AGI\u0026rsquo;s role is more about expanding the size of the \u0026ldquo;distributable pie\u0026rdquo; rather than automatically solving distribution problems. In other words, shared prosperity remains a political economy problem, not a technical one. AGI can only lower the cost of achieving goals; it cannot replace institutional design.\nSo, if we don\u0026rsquo;t retain the modern economy and instead try to bypass markets, prices, and incentive systems to directly rely on AGI to achieve some kind of \u0026ldquo;techno-utopia\u0026rdquo; — is it feasible?\nThe answer can almost certainly be determined as: no.\nA utopia without a modern economy was repeatedly verified as a failure in the 1960s–70s. The fundamental reason was not that \u0026ldquo;technology wasn\u0026rsquo;t advanced enough\u0026rdquo; at the time, but that the problems of information and incentives were structurally unsolvable: Even with powerful centralized computing capability, you cannot replace the preference information transmitted by dispersed individuals through price mechanisms, nor can you sustain innovation drive, responsibility constraints, and resource allocation efficiency over the long term. AGI can improve the computational capacity of centralized decision-making, but it cannot eliminate the fundamental economic question of \u0026ldquo;who is responsible for decisions, who bears consequences, who holds the right to choose.\u0026rdquo;\nTherefore, AGI is not a replacement for the modern economy, but an amplifier within the modern economy\u0026rsquo;s framework. Any \u0026ldquo;techno-utopia\u0026rdquo; that detaches from market mechanisms, incentive structures, and institutional constraints, whether AGI is introduced or not, will essentially replay the historical path of failure — just in more subtle forms and at higher cost.\nProductivity (including the intellectual enhancement brought by AGI) is only one of the conditions required for utopia, and far from the most critical one. Utopia is not a computing power problem, nor an intelligence problem — it is a problem of the stability of human behavior under institutional constraints.\nThe Mathematical Foundation for Why AI Cannot Solve Everything # The following is excerpted from Wu Jun\u0026rsquo;s The Beauty of Mathematics:\n\u0026ldquo;In 1900, Hilbert posed many problems, one of which was: \u0026lsquo;Can any (polynomial) Diophantine equation be determined, through a finite number of operations, to have integer solutions or not?\u0026rsquo; If the universal answer to Hilbert\u0026rsquo;s question is negative, then it means that for many mathematical problems, even God doesn\u0026rsquo;t know whether an answer exists — because the Diophantine equation solving problem is only a very small part of all mathematical problems. For problems whose very answer-existence cannot be determined, the answer naturally cannot be found. It was precisely Hilbert\u0026rsquo;s contemplation of the boundaries of mathematical problems that made Turing understand the limits of computation\u0026hellip; Matiyasevich rigorously proved that, except for a very small number of special cases, in general, it is impossible to determine through finite operations whether a Diophantine equation has integer solutions. The resolution of this problem had a far greater impact on human cognition than its mathematical influence\u0026hellip; If even the solution\u0026rsquo;s existence is unknown, it\u0026rsquo;s even more impossible to solve them through computation.\u0026rdquo;\n\u0026ldquo;A rational-state Turing machine can only solve a subset of problems that have answers\u0026hellip; Many engineering problems are not artificial intelligence problems\u0026hellip; Today, what we should worry about is not how powerful artificial intelligence or computers are, much less should we think they are omnipotent, because their boundaries have already been clearly delineated by the boundaries of mathematics\u0026hellip; There are still many problems in the world that need to be solved by humans. How to make good use of AI tools to more effectively solve human problems is what deserves more attention.\u0026rdquo;\n(See? Reading is useful, right? It explains it clearly to you directly — you probably couldn\u0026rsquo;t ask the right question or get such an accessible answer. See? Following hardcore tech content creators is useful — I filtered it for you. Hit that follow button \u0026#x2b50;)\nConclusion # As a technical blogger, I rarely write about such social issues. I originally just wanted to briefly write about why my previous article got traffic, but explaining this phenomenon somewhat expanded the scope of the problem \u0026#x1f613;.\nLimitations of this article:\nOnly discussed a very small part of DBA work — incident recovery — without discussing the intelligentization of other tasks. GPT knows me too well and seems to be flattering me. It indeed makes very valid points, but I cannot endorse what it says. This is somewhat circular: AI helps me confirm that AI cannot endorse things — an output that inherently cannot be endorsed. From my own perspective, its reasoning is indeed good, with quotable lines throughout. Some Ops scenarios are certainly easy to AI-ify. But through the discussion in this article, AI-ifying the incident recovery domain still faces considerable difficulty. I have never given up on using AI, nor have I ever given up on using the human brain. I simply enjoy identifying in which scenarios AI works well, in which it works poorly, and in which it cannot be used at all. This may give the article a tone that seems pessimistic about AI\u0026rsquo;s future, but my thinking is not pessimistic.\nAt the beginning of this article, you can see the AI rate is 50%. In reality, I also discussed similar issues with several friends and included my own thinking, so the true intellectual composition of this article is:\nAI rate 50%, other human brain rate 10%, my brain rate 40%\nSo this article is also a typical case of \u0026ldquo;not giving up on using AI, nor giving up on using the human brain.\u0026rdquo;\nLet me conclude with a few questions to briefly state my views:\nWhy do people still love reading human-written articles? Psychological preference and attention alignment.\nIs reading useful (not just books)? Useful, and more useful than ever (bad books are more useless than ever; knowledge taste is more important than ever).\nWill AIOps be realized? Yes, but it will take time, and it won\u0026rsquo;t be easy. This requires academic breakthroughs and the thinking and practice of operations (including DBAs).\nWill DBAs be replaced? No. Like software developers, they will experience changes in work patterns but will not disappear.\nWhich DBAs will remain? \u0026ldquo;Those who understand both DB and AI, who don\u0026rsquo;t depend on AI, yet don\u0026rsquo;t give up on judgment.\u0026rdquo;\nWill AGI be realized? Yes.\nWill AGI achieve universal prosperity? No.\nIf you\u0026rsquo;d like to discuss AI Ops or the issues in this article with me, you can find me in various PG groups — I should be easy to find. You can also leave me a message.\nref # https://github.com/TsinghuaDatabaseGroup/AIDB\nhttps://mp.weixin.qq.com/s/urqh4NZDmkXvDllBCCdZDA\nZhao, Y., et al. (2025). \u0026ldquo;STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds\u0026rdquo;. Advances in Neural Information Processing Systems (NeurIPS)\nhttps://zhuanlan.zhihu.com/p/631632685\nThe Beauty of Mathematics (《数学之美》)\n","date":"Jan 21, 2026","externalUrl":null,"permalink":"/en/2026/01/21/dba-writing-learning-and-the-future-in-the-ai-era/","section":"Posts","summary":"AI rate: This article has approximately 60% AI involvement, with about 20 rounds of battling with AI\nRecommendation reason: Contains some reflections and insights on AI Ops, hence recommended\nWriting in the AI Era # For authors who write blogs or WeChat public accounts, AI may be a fatal blow, because AI writing is simply too easy. As someone who writes articles myself, I have many internal struggles about how AI affects writing habits, and it pains me too. Let me revisit some earlier thoughts on writing:\n","title":"DBA, Writing, Learning and the Future in the AI Era","type":"posts"},{"content":"This is a technical operations summary, focused on being accessible and practical. It also serves as a periodic reflection on PostgreSQL database operations. Hope it helps fellow PGers.\nPrevious ops experience: PostgreSQL Operations Experience 2024. Note: this article does not repeat content from that one.\nCPU # SQL performance problems are the most common root cause in PostgreSQL incident handling. This includes poor SQL performance, suboptimal indexing, sudden high concurrency, and execution plan regressions. For a database like PostgreSQL that lacks a robust plan-binding mechanism, having a DBA team to help design data models, access patterns, indexes, and tune execution plans is crucial — it can significantly reduce sudden CPU saturation incidents.\nExecution Plans # Execution plan instability is an age-old problem with cost-based optimizers, and PostgreSQL is no exception.\nInaccurate DISTINCT Estimates # Case Study: From Inaccurate DISTINCT to DISTINCT Calculation Principles\nThe default maximum sample size is 30,000 rows. For tables exceeding this size, the estimated distinct count is likely to be low. Note: this assumes the data doesn\u0026rsquo;t have too many unique values.\nTesting on a table with different sample sizes:\nTable: reltuples=800 million, relpages=20 million, size=175GB, actual distinct on the target column: 100 million.\ntarget statistics pages sampling rate tuples sampling rate n_distinct execution time 50 0.00075 0.00001875 60K 2s 100 0.0015 0.0000375 110K 5s 1000 0.015 0.000375 1.03M 58s 3000 0.045 0.001125 2.68M 3m01s 10000 0.15 0.00375 6.75M 7m21s (target statistics max value: 10000)\nRough summary: n_distinct and analyze execution time grow proportionally with sample size.\nn_distinct increases with sample size, while pages and tuples estimates remain consistently accurate.\nGeneric Plan Interference # PostgreSQL execution plans must account for generic plans. A generic plan is parameter-independent — it uses default values to compute cost, then compares against the first five custom plan costs; whichever is cheaper wins.\nCase Study: Adding an Index Causes Performance Degradation and Generic Plans\nI. Classification of generic plan estimation problems\nBecause of the 5-execution comparison mechanism, generic plan problems fall into two categories:\nThe first 5 SQL executions are not representative. Heavily dependent on data skew and whether the first 5 parameter values are representative. The generic plan itself is flawed. Due to data skew or inability to accurately compute selectivity even with balanced data, the generic plan is inherently inefficient. II. Solution reference\nGeneric plan problems often surface on partitioned tables. When the partition key is continuous, scanning all partitions should yield a selectivity of 1, but the generic plan estimates 0.05 — likely resulting in a \u0026ldquo;full index scan\u0026rdquo; scenario.\nConsider these when optimizing:\nDon\u0026rsquo;t create too many indexes that confuse the optimizer Eliminate generic plan interference. Execute the prepared statement 6 times for real Compare plans with session-level set plan_cache_mode='force_generic_plan'; or set plan_cache_mode='force_custom_plan';; or on PG 16+, use explain (GENERIC_PLAN) to compare Syntax reference:\n--prepare/execute PREPARE sql1(text) AS SELECT COUNT(*) FROM LZL where a=$1; EXECUTE sql1(\u0026#39;zzz\u0026#39;); --run 6 times first EXPLAIN EXECUTE sql1(\u0026#39;zzz\u0026#39;); select * from pg_prepared_statements --view prepared statement info, current session only --Compare execution plans, set session parameter then EXPLAIN EXECUTE set plan_cache_mode=\u0026#39;force_generic_plan\u0026#39; set plan_cache_mode=\u0026#39;force_custom_plan\u0026#39; --Directly view generic plan, 16+ explain (GENERIC_PLAN) xx LWLock:Lockmanager Caused by Row Locks # LWLock Lockmanager issues typically occur on partitioned tables under high concurrency with queries lacking partition keys. This year, a new scenario was discovered: Row Locks Causing LWLock:Lockmanager\nThis isn\u0026rsquo;t a major issue — blocking on concurrent updates to the same row is well known. I just hadn\u0026rsquo;t expected that updating the same row could also produce LWLock:Lockmanager. Not a particularly valuable case study, but when you see LWLock:Lockmanager as a wait event, consider row locks.\nIdle Connections # PostgreSQL performance generally improves with each major release. PG 14 made significant optimizations to snapshot acquisition and backend transaction tracking, yielding noticeable improvements for high idle connection counts:\n(https://techcommunity.microsoft.com/blog/adforpostgresql/improving-postgres-connection-scalability-snapshots/1806462)\nHowever, this doesn\u0026rsquo;t mean you can ignore idle connections after PG 14. They still consume backend transaction maintenance overhead, cause context switches, fragment memory, etc. — the more idle connections, the worse the performance.\nTypically, application connections have keepalive and pooling. Maintaining some idle connections avoids creating new connections for every request, which would be far more expensive. Small databases generally don\u0026rsquo;t need to worry much about connection counts (as long as they\u0026rsquo;re not absurd) — CPUs are cheap, the system isn\u0026rsquo;t critical, and scaling is easy. But large databases are different. CPU count is the hard limit; you can\u0026rsquo;t just add more. Large databases already have many idle connections; adding more doesn\u0026rsquo;t necessarily increase throughput — when CPU is already tight, it can backfire.\nPG 15 benchmark experience: with 5K idle as baseline, increasing to 10K idle adds ~2-5 vCPU overhead for idle maintenance; 20K idle adds ~5-10 vCPU. Approximate.\nIdle in Transaction # Last year I thoroughly criticized long transactions, because they impact PostgreSQL more severely than other databases (Oracle, MySQL, etc.). But this is manageable — with proper alerting and operations, long transactions are solvable.\nWhen monitoring session states, you need to check them. active means running SQL, idle in transaction means in a transaction but not currently executing SQL. All pg_stat_activity states, PG 15:\nCurrent overall state of this backend. Possible values are:\nactive: The backend is executing a query. idle: The backend is waiting for a new client command. idle in transaction: The backend is in a transaction, but is not currently executing a query. idle in transaction (aborted): This state is similar to idle in transaction, except one of the statements in the transaction caused an error. fastpath function call: The backend is executing a fast-path function. disabled: This state is reported if track_activities is disabled in this backend. Common states are: active, idle, idle in transaction, idle in transaction (aborted). A common misconception about idle in transaction: it only means no SQL is running right now and the transaction hasn\u0026rsquo;t committed — it does NOT mean the transaction has been idle for a long time. Don\u0026rsquo;t use xact_start + idle in transaction to judge how long a transaction has been idle. Use state_change + idle in transaction instead.\nMemory # Memory issues are extremely tricky, and I handled many this year, finding some good solutions. But memory knowledge is broad — I\u0026rsquo;ll try to simplify as much as possible, going straight to symptoms, results, and solutions.\nMemory Issues and Huge Pages # Classification of PostgreSQL memory problems:\nRelevant wchan states for PG memory issues:\nHuge pages are very effective against memory fragmentation and direct memory reclaim within cgroups.\nBenchmark results for huge pages: https://docs.paic.com.cn/#/post/84479375\nTheoretical benefits of huge pages:\nReduced TLB pressure Reduced page table size in main memory Huge pages are physically contiguous. Contiguous physical memory access is better than non-contiguous With huge pages, pages are directly mapped without multi-level PTE entries However, huge pages bring management challenges:\nMust pre-allocate huge pages Must calculate huge page size in advance to avoid memory waste Memory knowledge is extensive. For more, refer to Advanced Linux Memory. Key takeaways:\nRule out OS-level issues before tackling PG instance-level issues Huge pages have remarkable effects, but in rare cases they don\u0026rsquo;t help Many people don\u0026rsquo;t monitor pgpgin/pgpgout/pgfree, or even pgscank/pgscand — they only look at CPU and memory usage. That\u0026rsquo;s insufficient for operating PostgreSQL. Without good operational practices, PG memory can be very unstable Notable Cgroup Knowledge # Cgroup knowledge is also extensive. Refer to earlier articles; here\u0026rsquo;s a quick summary.\nCgroup v1 has inherent flaws:\nDoes not account for cgroup page tables Does not account for cgroup slab Does not account for cgroup huge pages (huge pages are not charged, not just uncounted) Does not account for cgroup async/sync page reclaim Cgroup RSS and process RSS have inconsistent accounting methods shmem accounting is messy Unsolved Mysteries # Huge pages have solved many problems, but not all. The unsolved portion remains to be researched — hopefully clarified in 2026.\nPay Attention to the OS # Pay Attention to Everything OS # To operate open-source databases well, you need to understand the operating system.\n(Source forgotten)\nTo operate PostgreSQL well, understanding OS principles is essential. PostgreSQL is built on top of the OS (especially Linux) — it uses whatever Linux provides. PostgreSQL is part of the Linux ecosystem. To truly understand how it works, understand the OS first.\nRule out OS-level issues before tackling PG instance-level issues.\n(My own words)\nI. CPU\nSince PostgreSQL doesn\u0026rsquo;t use NUMA, whether on bare metal or cgroup/pod-managed CPU, you rarely need to dive into OS-level CPU internals. CPU issues can mostly be diagnosed from SQL or PG stack traces.\nII. Memory\nSee the Memory section. Memory issues require OS-level investigation.\nIII. Processes\nInspecting PG process states from the OS is critical. You need to check D state, wchan, RSS, syscalls, at minimum.\nIV. Host Status and Logs\nMonitor host status — CPU, memory, IO, network, logs at the host level. Very important.\nIt\u0026rsquo;s hard to imagine that a vague network IO alert like \u0026ldquo;an I/O error occurred while sending to the backend\u0026rdquo; is related to underlying storage. Beyond /var/log/messages, PG itself shows nothing. (Of course, this error may have other causes — don\u0026rsquo;t misinterpret.)\nV. Others\nUncategorized.\nPhysical Reads # PostgreSQL itself does not directly expose a \u0026ldquo;true physical disk read\u0026rdquo; metric. The various reads in pg_stat_* (e.g., pg_stat_database.blks_read) are reads from the OS cache.\nSo how do you monitor physical reads?\nReads or buffer allocation metrics are supplementary. The best approach is monitoring the OS itself.\nThe OS is PostgreSQL\u0026rsquo;s ecosystem. Never look at the database in isolation. Not being able to monitor physical reads at the database level is nothing to be ashamed of — as long as you have a solution.\nMonitor iostat and other disk metrics. For cloud environments, OS-level observability is already mature — don\u0026rsquo;t waste cloud-native observability.\nAutovacuum # SQL for monitoring autovacuum processes: sql autovacuum_queue_and_progress\nAutovacuum Freeze on Large Databases # With properly configured parameters, monitoring, and alerting, autovacuum freeze requires little attention in most databases.\nHowever, in databases with extremely high transaction throughput and very large data volumes, you still can\u0026rsquo;t ignore it. Autovacuum prevent wraparound may be running constantly. At minimum, watch these two points:\nAge alerting: handle promptly and try to prevent the next alert. Don\u0026rsquo;t wait until the last moment to panic (acceleration options depend on version, e.g., INDEX_CLEANUP OFF, BUFFER_USAGE_LIMIT adjustments) Impact on memory (especially cache). If autovacuum runs nonstop on a very large database, it impacts cache and memory For principles and parameters, see this howtos diagram:\nLarge Tables That Won\u0026rsquo;t Finish Vacuuming # \u0026ldquo;Large tables\u0026rdquo; means hundreds of GB, typically with many indexes and dead tuples that prevent vacuum from completing.\nThe main bottleneck: (auto)vacuum cleans dead index tuples one by one per dead row. Large table (auto)vacuum is slow here — you\u0026rsquo;ll typically see many dead tuples on the table. Worse, (auto)vacuum may run slower than the rate of dead tuple generation — vacuum never finishes, infinite bloat.\nExperience with large tables that can\u0026rsquo;t finish:\nFor the same table, dead tuple count is roughly proportional to execution time From autovacuum log\u0026rsquo;s user time and elapsed time, you can observe CPU time and execution time, and roughly estimate delay sleep time Disabling autovacuum cost-based delay can reduce execution time by ~3× (index-size dependent; based on a 200GB table with 280GB indexes) Adjusting a table\u0026rsquo;s autovacuum cost-based delay means letting autovacuum rest less when processing that table — consuming more CPU and scan IO in a shorter time How to accelerate?\nRepack. Repack is a nuclear option — fast table rebuild for emergencies. But repack is a CLI tool; running it manually each time is cumbersome. Tune autovacuum cost-based delay parameters. Either 1. Increase cost limit: alter table t1 SET (autovacuum_vacuum_cost_limit=1000);, or 2. Disable delay entirely: alter table t1 SET (autovacuum_vacuum_cost_delay=0);. Recommended only for tables that can\u0026rsquo;t keep up. Drop unnecessary indexes. Scanning indexes and updating index entries takes the most time — dropping unnecessary indexes is effective. Partitioned tables. Recommended partition size ≤10GB. Converting to partitioned tables is the best solution. Drop updated_time column indexes to leverage HOT, reducing bloat rate. Checkpoint and Bgwriter # The checkpointer not only creates checkpoints (affecting recovery time) but also flushes dirty buffers. The bgwriter only flushes dirty buffers. Starting from PG 17, some metrics moved to pg_stat_checkpointer. For PG ≤16, mainly look at pg_stat_bgwriter.\nI. Checkpoint intervals\nMetric checkpoints_timed: corresponds to checkpoint_timeout parameter Metric checkpoints_req: corresponds to max_wal_size parameter Recommend using checkpoint_timeout as the primary checkpoint interval. If checkpoints_req appears, increase max_wal_size and tune flush parameters accordingly. When FPIs are present, also check these two metrics.\nII. Flush metrics\nMetric buffers_checkpoint: dirty buffers flushed by checkpointer Metric buffers_clean: dirty buffers flushed by bgwriter Metric buffers_backend: dirty buffers flushed by backends — should be as close to zero as possible; occurrence means bgwriter isn\u0026rsquo;t aggressive enough Metric buffers_backend_fsync: meaning unclear The tuning goal is flush priority: bgwriter flush \u0026gt; checkpointer flush \u0026gt; backend flush\nThe checkpointer can flush as a side effect, but checkpointer flush speed is hard to control — it can cause IO spikes. So bgwriter flush priority should be higher than checkpointer. Backend flush is obviously worst — minimize it.\nIII. Bgwriter flush parameters\nBgwriter controls flush speed through a \u0026ldquo;write some, pause, write again\u0026rdquo; cycle:\nParameter bgwriter_delay: how long to pause Parameter bgwriter_lru_maxpages: max pages to write per cycle Parameter bgwriter_lru_multiplier: pages per cycle = (recent buffer allocation × lru_multiplier), capped at lru_maxpages Parameter bgwriter_flush_after: fsync after writing this many buffers Metric pg_buffers_alloc: represents shared memory buffer allocation (allocation means actual eviction occurred, somewhat indicative of pgpgin) Metric maxwritten_clean: number of times bgwriter_lru_maxpages was reached Default bgwriter flush logic: each cycle: flush (new buffer count × 2, max 100 dirty buffers), delay 200ms, fsync every 64 buffers flushed.\nPer-cycle flush volume depends on recent buffer allocation and bgwriter_lru_multiplier. During peak times, buffer allocation is typically high, so it usually hits bgwriter_lru_maxpages. Thus: bgwriter_lru_maxpages caps peak flush volume; bgwriter_lru_multiplier prevents excessive flushing during off-peak times.\nIV. Flush parameter reference\nDefault max bgwriter flush = 100 × 5 × 8KB = 3.9MB/s. The defaults are definitely too low. If tuning upward, adjust based on shared_buffers size and workload.\nAfter all that theory, here\u0026rsquo;s a practical reference:\n#Read/write ratio 2:8, high load shared_buffers=40GB checkpoint_timeout=20min; max_wal_size=80GB bgwriter_delay=20ms bgwriter_lru_maxpages=1000 bgwriter_lru_multiplier=4 Adjust further as needed.\nAs for effects: from practical experience, don\u0026rsquo;t expect standalone bgwriter tuning to yield great results. Overly aggressive bgwriter tuning can even backfire.\nSo: If your database hasn\u0026rsquo;t been clearly diagnosed with checkpoint flush spikes or other flush issues, don\u0026rsquo;t touch this. Only recommended for core large databases with high concurrency, as a supplementary tuning strategy alongside other changes (migrations, shared_buffer adjustments, etc.).\nV. Flush parameter summary\nBgwriter flushing can be summarized as \u0026ldquo;three hard\u0026rsquo;s\u0026rdquo;:\n\u0026ldquo;Hard to understand, hard to tune, hard to see results.\u0026rdquo;\nDB4AI # AI Task Scheduling Writes to Database # AI applications are widely deployed at the development level. One scenario: AI task invocations write to the database. Task invocations can spike instantly, and the database writes may lack concurrency control, causing CPU or other resource spikes.\nThis is a new database incident pattern in the AI era. Be careful.\nVector HNSW # Reference: https://postgresql.us/events/pgconfnyc2024/sessions/session/1862/slides/172/pgvector_best_practices_pgconfnyc2024.pdf\nHNSW Index Build Acceleration # HNSW index builds can be extremely slow — millions of rows can take hours.\nFactors affecting HNSW build speed include instance memory (and CPU) as well as index build parameters:\nmaintenance_work_mem=3g max_parallel_maintenance_workers=2 m=12 ef_construction=100 Building HNSW indexes can be painful. Ways to accelerate:\nBuilding the index before data load is an option. Though the total initial time is slower, developers may accept \u0026ldquo;a bit slower\u0026rdquo; but cannot accept \u0026ldquo;index building for 1 hour.\u0026rdquo; Optimizing post-load index builds: SET maintenance_work_mem = '8GB' SET max_parallel_maintenance_workers = 8 Post-load index builds need attention to memory — strongly related to instance memory and free memory.\nNote: maintenance_work_mem can protect OS memory. If maintenance_work_mem exceeds available OS memory and the table is large, the connection is terminated immediately (fast failure):\nERROR: 53200: could not resize shared memory segment \u0026#34;/PostgreSQL.1390017142\u0026#34; to 6439348672 bytes: Cannot allocate memory LOCATION: dsm_impl_posix, dsm_impl.c:314 Note: if memory used during build exceeds maintenance_work_mem, an info notice appears (after some time):\nNOTICE: 00000: hnsw graph no longer fits into maintenance_work_mem after 886990 tuples DETAIL: Building will take significantly more time. HINT: Increase maintenance_work_mem to speed up builds. LOCATION: InsertTuple, hnswbuild.c:525 HNSW Index Query Performance # Query recall and performance need to be balanced via the ef_search parameter.\nBesides ef_search, one more factor significantly impacts query speed: whether the HNSW index is cached in memory.\nIndex NOT in memory:\nexplain (analyze,buffers) SELECT image_id, applyNo, feature_vector \u0026lt;-\u0026gt; (select vectorsit FROM image_features_test2 ORDER BY distance LIMIT 10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=11852.80..11865.74 rows=10 width=35) (actual time=82193.073..82193.185 rows=10 loops=1) Buffers: shared hit=1796 read=9309 I/O Timings: shared/local read=82108.559 InitPlan 1 (returns $0) -\u0026gt; Limit (cost=0.00..0.02 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1) Buffers: shared hit=1 -\u0026gt; Seq Scan on test_0 (cost=0.00..23.60 rows=1360 width=32) (actual time=0.007..0.008 rows=1 loops=1) Buffers: shared hit=1 -\u0026gt; Index Scan using idx_feature_hnsw on image_features_test2 (cost=11852.78..1292546.60 rows=989705 width=35) (actual time=82193.071..82193.179 rows=10 loops=1) Order By: (feature_vector \u0026lt;-\u0026gt; $0) Buffers: shared hit=1796 read=9309 I/O Timings: shared/local read=82108.559 Planning: Buffers: shared hit=1 Planning Time: 0.130 ms Execution Time: 82193.279 ms Index IN memory:\nQUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=11852.80..11865.74 rows=10 width=35) (actual time=20.240..20.350 rows=10 loops=1) Buffers: shared hit=11105 InitPlan 1 (returns $0) -\u0026gt; Limit (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.008 rows=1 loops=1) Buffers: shared hit=1 -\u0026gt; Seq Scan on test_0 (cost=0.00..23.60 rows=1360 width=32) (actual time=0.007..0.007 rows=1 loops=1) Buffers: shared hit=1 -\u0026gt; Index Scan using idx_feature_hnsw on image_features_test2 (cost=11852.78..1292546.60 rows=989705 width=35) (actual time=20.239..20.344 rows=10 loops=1) Order By: (feature_vector \u0026lt;-\u0026gt; $0) Buffers: shared hit=11105 Planning: Buffers: shared hit=1 Planning Time: 0.093 ms Execution Time: 20.392 ms Same index, same execution plan — the performance difference between index-in-memory and index-not-in-memory is 82193.279 / 20.392 = 4000×!\nThis gap cannot be ignored. When monitoring HNSW index performance, always check whether the index is in memory. Reference SQL:\n--Check if HNSW index is cached in shared buffers via pg_buffercache SELECT c.relname, pg_size_pretty(count(*) * 8192) as buffered, round(100.0 * count(*) / (SELECT setting FROM pg_settings WHERE name=\u0026#39;shared_buffers\u0026#39;)::integer, 1) AS buffer_percent, round(100.0 * count(*) * 8192 / pg_table_size(c.oid), 1) AS percent_of_relation FROM pg_class c INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database()) GROUP BY c.oid, c.relname ORDER BY 3 DESC LIMIT 10; relname | buffered | buffer_percent | percent_of_relation ---------------------------------+------------+----------------+--------------------- idx_feature_hnsw_1 | 2117 MB | 91.9 | 44.5 idx_feature_hnsw | 78 MB | 3.4 | 2.0 pg_inherits_parent_index | 8192 bytes | 0.0 | 100.0 Application Releases # DDL Tips # Online DDL tools like pg-osc and pg_migrate don\u0026rsquo;t support partitioned tables, and they have other issues — real-world use is difficult. So DDL tips are still useful: lowering lock levels, proactively identifying blocking, etc., to reduce DDL blocking and rewrite risks.\nKey points for understanding this diagram:\nBefore changes:\nEnsure no long transactions on the table — long transactions hold locks on tables persistently. Long transactions are a well-known hazard in PG; handle them first.\nEnsure no autovacuum (to prevent wraparound) on the table — autovacuum generally doesn\u0026rsquo;t block SQL, except when running to prevent wraparound:\nAutovacuum workers generally don\u0026rsquo;t block other commands. If a process attempts to acquire a lock that conflicts with the SHARE UPDATE EXCLUSIVE lock held by autovacuum, lock acquisition will interrupt the autovacuum. However, if the autovacuum is running to prevent transaction ID wraparound (i.e., the autovacuum query name in the pg_stat_activity view ends with (to prevent wraparound)), the autovacuum is not automatically interrupted.\nlock_timeout=2000 — if a lock cannot be acquired within 2 seconds, bail out to avoid mass blocking.\nSpecial cases for \u0026ldquo;small-to-large\u0026rdquo; type changes:\nSmall-to-large type changes generally don\u0026rsquo;t rewrite the table, but there are exceptions. Pay special attention to int → bigint (common for PK columns) and char(n) → char(m). Partitioned table indexes. Small-to-large type changes on partitioned tables don\u0026rsquo;t rewrite the table, but they do rebuild indexes — and rebuilding indexes on partitioned tables is typically very slow, potentially causing prolonged level-8 lock blocking. This behavior is unique to partitioned tables. Changing column types:\nAlmost always rewrites the table, except for equivalent types or small-to-large cases. DDL lock-level reduction tips:\nUse CIC (CREATE INDEX CONCURRENTLY) for indexes. If partitions don\u0026rsquo;t support it, do CIC on child tables (remember to attach the index). CIC has multiple phases. Phases 2 and 3 acquire a SHARE lock, blocking DML. (Official docs only mention SHARE UPDATE EXCLUSIVE — CIC isn\u0026rsquo;t a simple explicit lock.) Add primary keys with USING INDEX. For partitions, leverage \u0026ldquo;add PK on child table + add PK on parent can merge existing child PKs.\u0026rdquo; Use VALIDATE CONSTRAINT for constraints. PG \u0026lt;17 doesn\u0026rsquo;t support NOT NULL VALIDATE. Use CHECK(col1 IS NOT NULL) instead. This CHECK-to-NOT-NULL conversion won\u0026rsquo;t produce extra scans. Adding a column with a volatile DEFAULT rewrites the table. Use the non-volatile-no-rewrite property: add the column first (no rewrite), then UPDATE legacy data as needed. When attaching partitions, use CHECK constraints to reduce downtime, and use VALIDATE CONSTRAINT for the CHECK. CREATE TABLE LIKE + ATTACH has much lower lock levels than PARTITION OF (though I still prefer PARTITION OF). After changes:\nRemember to collect statistics (needed in many scenarios). Parallel Index Creation # In production, you may need to create indexes on very large tables that take a long time. Parallel index creation can shorten build time.\nParallel index creation on regular tables:\nParallel parameter: max_parallel_maintenance_workers\nPrerequisites:\nEnough workers: check max_parallel_workers, max_worker_processes Increase maintenance_work_mem to GB scale Notes:\nEffective for B-tree and BRIN maintenance_work_mem limits the entire utility command. Unlike parallel query, where resource limits are per worker process. From test results, parallel index creation shows diminishing returns beyond 8 workers (this conclusion may not hold in all environments).\nParallel index creation on partitioned tables:\nRecommend manual parallel creation across child partitions — run index creation on multiple partitions simultaneously rather than using native parallelism. This reduces multi-process coordination overhead.\nCached Plan Must Not Change Resource # After adding a new column the previous night, application connections started throwing errors the next morning: \u0026ldquo;cached plan must not change result type in PostgreSQL\u0026rdquo;\nReproduction:\ncreate table a(b varchar(10)); PREPARE p1 (varchar) AS SELECT * FROM a WHERE b=$1; ALTER TABLE a ALTER COLUMN b TYPE varchar(20); EXECUTE p1 (\u0026#39;abcd\u0026#39;); ERROR: 0A000: cached plan must not change result type LOCATION: RevalidateCachedQuery, plancache.c:718 Test environment solutions: DEALLOCATE ALL — actively discard prepared statements Or, DISCARD ALL — actively discard all session state\nDEALLOCATE ALL; --DISCARD ALL PREPARE p1 (varchar) AS SELECT * FROM a WHERE b=$1; EXECUTE p1 (\u0026#39;abcd\u0026#39;); Production environment solutions:\nSince the error occurs at the application layer, JDBC can handle DEALLOCATE ALL / DISCARD ALL, but the application may not have implemented this. Immediate production solutions:\nSolutions (choose one):\nSince connection pools like HikariCP have connection cycling and timeout mechanisms, killing idle sessions will gradually reduce errors. Similarly, due to connection pool cycling, you can do nothing — as the pool gradually establishes new connections, the errors fade. If business pressure is high enough, consider killing all application connections. Rolling restart of the application. Not recommended:\n\u0026ldquo;Restart the application after every DDL.\u0026rdquo; It works but don\u0026rsquo;t recommend this as a standard practice. autosave=conservative. It works but enables subtransactions. A savepoint is set for each query; rollback happens only for rare cases like \u0026lsquo;cached statement cannot change return type\u0026rsquo; or \u0026lsquo;statement XXX is not valid,\u0026rsquo; where the JDBC driver rolls back and retries. JDBC configuration suggestions:\nConfigure automatic retry after transaction rollback: https://developer.aliyun.com/article/741750 Other JDBC config references: https://jdbc.postgresql.org/documentation/server-prepare/#corner-cases. Note: some suggestions are not suitable for production. Physical Replication # Query Conflicts # Query conflicts are a notoriously frustrating feature that directly impacts the usability of PG standby queries. Query conflicts increase standby lag, yet long-running queries on the standby are logically reasonable. This forces PG administrators to balance between lag management and long-query management — a problem that doesn\u0026rsquo;t exist in other relational databases.\nHidden characteristics of query conflicts:\nEven static tables can trigger query conflicts (see: From Static Table Query Conflicts to Their Principles). The conflict is a snapshot conflict, largely unrelated to table-level locks — snapshot conflicts are cross-table. Long queries affect short queries. Once a long query pushes standby lag to max_standby_streaming_delay, even short queries get canceled. Continuous short queries also cause query conflicts. For example, one short query hasn\u0026rsquo;t finished when the next starts — the two queries may be logically similar, and the startup process hasn\u0026rsquo;t had time to apply WAL. Both short queries hold the XID that needs to be applied. Check whether pg_stat_activity.backend_xmin is less than the XID the startup process is applying. Recommended standby query practices:\nUsing RTO SLO to tune max_standby_streaming_delay is a good approach. When arguments lead nowhere, SLO-based IT management saves the day. Separate short/fast business queries from long queries (data extraction, reporting) onto different standbys to reduce mutual interference. Standby queries still need SQL optimization. Standby WAL apply lag must be monitored. Logical Replication # Logical replication has countless pitfalls. 2024 had many nasty cases; 2025 had some too, but less severe, mostly on older PG versions. Overall, logical replication on newer PG versions is trending toward stability.\nSlow DDL/DCL Parsing on Older PG Versions # Case Study: GRANT and Walsender Stuck\nOn PG 13 and earlier, certain DDL/DCL statements parse slowly and may affect walsender lag. These include:\nBatch GRANT (including grant all tables) + pathman extension installed (whether used or not) Batch DDL/TRUNCATE/DCL/DROP PUBLICATION Older PG + Multiple Replication Links + Flink # Flink requires one link per table. Since PostgreSQL walsenders re-decode independently, dozens of Flink links on one PG database are common — and hard to refactor.\nOn PG 11 and earlier, the walsender main loop calls PostmasterIsAlive(), causing poor loop performance. Starting from PG 12, WalSndLoop no longer polls PostmasterIsAlive() in the main loop; instead, status checks are placed inside WalSndWait, using event-based passive notification. This greatly reduces CPU contention.\nIf you have multiple Flink links on an older PG version, upgrading can alleviate certain walsender resource contention issues, including:\nMay resolve the problem where walsender startup resource contention prevents the database from coming up for a long time May resolve upstream heavy data changes (including DDL rewrites) causing runtime walsender log decoding CPU saturation Older PG Cannot Auto-Sync New Partitions # On older PG versions with declarative partitioning, note that you can only publish child tables individually. PG ≥13 supports publishing by parent table. Below that, you must configure sync per partition child table name:\nAllow partitioned tables to be logically replicated via publications (Amit Langote) § §\nPreviously, partitions had to be replicated individually. Now a partitioned table can be published explicitly, causing all its partitions to be published automatically. Addition/removal of a partition causes it to be likewise added to or removed from the publication. The CREATE PUBLICATION option publish_via_partition_root controls whether changes to partitions are published as their own changes or their parent\u0026rsquo;s.\nIn other words, if this partitioned table is an upstream for sync, every time a new partition is added, you must adapt the sync tool to publish it — otherwise, new partition data won\u0026rsquo;t sync.\nMigration and Upgrades # Xinchuang Migration and glibc Upgrades # Whether it\u0026rsquo;s Xinchuang (domestic tech migration) or Linux OS version upgrades, glibc upgrades may be involved — and glibc upgrades can be extremely painful. PG sorting was entirely OS-dependent before PG 17.\nPostgreSQL cannot detect compatibility issues from glibc upgrades. Every minor version of GNU C library makes locale changes. The most problematic version in practice is glibc 2.28, because 2.28 upgraded to a major Unicode 9.0.0 release (has been updated to a new upstream version from ISO which is in sync with Unicode 9.0.0).\nCollations come in many types, and many environments use linguistic sorting (e.g., en_US.utf8), which is the most version-sensitive. Collation changes most commonly cause database crashes during index scans, but also uncommon issues like duplicate primary keys, data landing in wrong partitions, inconsistent merge join results, etc.\nFortunately, PG 17 provides a very safe locale provider: builtin, no longer dependent on OS-provided glibc, ICU, etc. Example:\ninitdb --locale-provider=builtin --bultin-locale=C.UTF-8 dbname1 However,\nbuiltin is great but arrived too late. Converting existing production instances to builtin collation is no small task. Moreover, Xinchuang migrations or OS upgrades may not mandate database upgrades.\nDuring Xinchuang migration, the target host\u0026rsquo;s glibc version is typically higher than the old Intel server\u0026rsquo;s — likely crossing version 2.28. Combined with tight deadlines, KPI pressure, staffing shortages, and large databases, physical migration is unavoidable. So physical Xinchuang migration must account for glibc version and collation-induced anomalies.\nWhat can you do after physical migration?\nI. Official required steps\nCheck indexes, rebuild those clearly problematic REFRESH DATABASE COLLATION VERSION Check dependent objects REFRESH COLLATION VERSION II. Unofficial \u0026ldquo;dark arts\u0026rdquo; approaches\nI don\u0026rsquo;t have a complete solution, just ideas:\nHandle partitioned table data landing in wrong partitions\nPartition key is int/bigint/float: unrelated to collation, don\u0026rsquo;t worry Partition key is timestamp: don\u0026rsquo;t worry; if varchar or other character types: evaluate Partition key is character type: refer to \u0026ldquo;a\u0026rdquo; vs \u0026ldquo;-\u0026rdquo; sort order (pgconf Collation Challenges Sorting It Out). But note: If querying data, don\u0026rsquo;t query from the parent table — may crash or return nothing No simple detection method Handle primary key / unique key conflicts\nHandle FDW sort range anomalies\nUnknown issues\nReference: collation\nSmooth Major Version Upgrades # https://gitlab.com/postgres-ai/postgresql-consulting/postgres-howtos/-/blob/main/0077_zero_downtime_major_upgrade.md?ref_type=heads\nhttps://www.postgresql.eu/events/pgconfeu2023/sessions/session/4791/slides/439/2023.pgconf.eu%20Zero%20Downtime%20PostgreSQL%20Upgrades.pdf\nCommon major version upgrade approaches:\npg_upgrade in-place upgrade. Not recommended — may blow up in place. pg_dump: suitable for small databases, longer maintenance windows. Logical sync + switchover (pub/sub, pg_logical, DTS, etc.): suitable for small databases, shorter windows. Physical forward sync + logical reverse sync: suitable for large databases, not-too-short windows. Physical replication full sync + logical incremental sync + switchover: suitable for large databases, extremely short windows. Syncing full data via logical replication can be extremely slow. In-place upgrade of a new standby carries uncertainty and upgrade time, plus the need for reverse logical sync. \u0026ldquo;Smooth major version upgrade\u0026rdquo; is essentially \u0026ldquo;physical replication full sync + logical incremental sync + switchover.\u0026rdquo;\nKey technique: the primary creates a slot and returns an LSN. The new standby uses recovery_target_lsn to recover to that LSN, then logical sync begins.\nApproximate workflow:\nPre-checks. Multi-database (consider applying one slot LSN for all), extensions, pathman, triggers, foreign keys, unlogged tables, crontab, etc. Physical sync. Old and new version software, compare and backup conf files, pg_basebackup to build new standby on old version. Logical sync prep 1. Primary keys and replica identity, create publication; prohibit application DDL/DCL. Restore new standby to target LSN. Stop new standby; create slot on old primary and record LSN; start new standby with target LSN. New standby major version upgrade. Upgrade, handle issues, switch environment variables. Logical sync prep 2. Disable triggers, foreign keys, jobs, extensions, etc. Logical sync. Create subscription with specified slot, copy_data=false. Post logical sync. Check for index corruption, check logs for errors and fix, rebuild remote standbys. Switchover. Stop application; advance sequences, enable foreign keys, triggers, jobs, etc. Switchover. Build reverse link (old primary subscribes). Switchover. Application cutover. The smooth major upgrade approach is smooth for the business but complex for the DBA. It combines all the drawbacks of logical and physical migration — quite painful to execute. The steps above are already simplified. This approach consumes DBA manpower; consider it only for the most critical databases.\nPartitioned Table Management # PostgreSQL partitioned tables are very flexible, lack built-in interval partitioning, and have varied behavior across versions — making partition management problems an annual occurrence. I believe many PG DBAs still worry about new partition issues.\nMy observations on partition management and usage issues:\nNot using declarative partitioning. Older versions still use pathman partitioning or inheritance-based partitioning, or continue using them even after upgrading. Declarative partitioning was introduced in PG 10. Due to early version limitations, recommend only using declarative partitioning from at least PG 12 onward to reduce environmental complexity. Developers building child table indexes/primary keys directly. Creating indexes/PKs directly on child tables via SQL rather than through parent table inheritance means the next developer writing SQL may forget. This leads not only to parent-child inconsistency but also child-child inconsistency, eventually making the partition structure unrecognizable. No new partition management strategy. Forgetting to create new partitions or using a DEFAULT partition. Typically, developers create partitions for a few years ahead; next time, the developers may have moved on, and no one manages new partition creation. This is a ticking time bomb, or data lands in the DEFAULT partition, defeating the purpose of partitioning. Lack of DBA management. Yes, DBA! PG partitioned table knowledge is extensive (see PostgreSQL Partitioned Tables). How to build management strategies and implement them in your environment requires proactive DBA involvement. This may be the most important factor. My partition management goals (from Case Study: 2026-01-01 Partition Data Update Failure):\nUse the parent table structure as the canonical structure — the parent table faces developers; it should have primary keys, indexes, and replica identity (unless the PG version doesn\u0026rsquo;t support it). Keep parent and child tables consistent. Use PARTITION OF when creating new partitions (yes, I don\u0026rsquo;t recommend ATTACH). Keep child tables consistent with each other. Create new partitions in advance. Partition data volume should not be too large. DEFAULT partitions are not recommended. If created, must monitor writes to them. Queries on frequently accessed tables must include the partition key for partition pruning. Otherwise, convert to a regular table. Observability # The official documentation clearly explains database, table, index, SQL, flush, and other metrics.\nA few metrics deserve special attention — not only are they unclearly explained, but they\u0026rsquo;re frequently used and have a learning curve.\nbuffers_alloc, blks_read # pg_stat_bgwriter.buffers_alloc: Number of buffers allocated — shared memory eviction volume. pg_stat_database.blks_read: OS cache reads. (buffers_alloc may appear in different views across PG versions, but the meaning is the same.)\npg_stat_bgwriter.buffers_alloc is the shared memory buffer allocation count (called buffer allocation in the source). It represents shared memory eviction volume — newly started databases typically have higher values. When observing shared memory busyness, buffer allocation may be better than hit ratio — high hit ratios can be inflated by frequent small-table access, while allocation represents actual eviction.\nbuffers_alloc counts buffers allocated after reading from cache and loading into a new shared buffer — somewhat representative of OS cache reads too? But in practice, buffers_alloc and blks_read have similar meanings yet can differ significantly in value. Why? Unclear, pending research.\nSource: numBufferAllocs\ntup_fetched, tup_returned # These are metrics in pg_stat_database:\ntup_fetched: Number of rows ultimately returned from index scans, after removing filtered rows, dead tuples, and invisible rows. Result-oriented. tup_returned: Number of rows fetched from the table during index scans, regardless of filter conditions, dead tuples, or visibility. Process-oriented. Thus, tup_returned is typically much higher than tup_fetched. An abnormally high tup_returned suggests optimization opportunity — after all, many rows were accessed but few returned to the client.\nidx_tup_fetch, idx_tup_read # These are metrics in pg_stat_all_indexes:\nidx_tup_read: Number of index entries accessed (counted from the index side), includes bitmap scans. idx_tup_fetch: Number of rows ultimately returned from index scans (counted from the table side), excludes bitmap scans. Madness.\nOne thing to remember: xx_tup_fetch refers to the final rows returned after index access + table fetch — result-oriented.\nReferences # postgres-ai howtos\nBest practices for using pgvector\nCase Study: 2026-01-01 Partition Data Update Failure\nPostgreSQL Partitioned Tables\nCase Study: From Inaccurate DISTINCT to DISTINCT Calculation Principles\nCase Study: Adding an Index Causes Performance Degradation and Generic Plans\nFrom Static Table Query Conflicts to Their Principles\nControl File Parameters and Primary-Standby Parameter Mismatch\nhttps://liuzhilong.blog.csdn.net/article/details/130783036\nhttps://techcommunity.microsoft.com/blog/adforpostgresql/improving-postgres-connection-scalability-snapshots/1806462\nhttps://www.postgresql.org/docs/17/sql-prepare.html\nhttps://www.postgresql.org/docs/17/sql-deallocate.html\nhttps://www.postgresql.org/docs/release/13.0/\nhttps://jdbc.postgresql.org/documentation/use/\nhttps://jdbc.postgresql.org/documentation/server-prepare/#server-prepared-statements\nhttps://www.postgresql.eu/events/pgconfeu2023/sessions/session/4791/slides/439/2023.pgconf.eu%20Zero%20Downtime%20PostgreSQL%20Upgrades.pdf\nThanks to Master Gao for the 2025 battles.\n","date":"Jan 11, 2026","externalUrl":null,"permalink":"/en/2026/01/11/postgresql-operations-experience-2025/","section":"Posts","summary":"This is a technical operations summary, focused on being accessible and practical. It also serves as a periodic reflection on PostgreSQL database operations. Hope it helps fellow PGers.\nPrevious ops experience: PostgreSQL Operations Experience 2024. Note: this article does not repeat content from that one.\nCPU # SQL performance problems are the most common root cause in PostgreSQL incident handling. This includes poor SQL performance, suboptimal indexing, sudden high concurrency, and execution plan regressions. For a database like PostgreSQL that lacks a robust plan-binding mechanism, having a DBA team to help design data models, access patterns, indexes, and tune execution plans is crucial — it can significantly reduce sudden CPU saturation incidents.\n","title":"PostgreSQL Operations Experience 2025","type":"posts"},{"content":" Symptoms # On December 30, business errors were reported — data could not be updated:\nERROR: 55000: cannot update table \u0026#34;tablzl_202601\u0026#34; because it does not have a replica identity and publishes updates HINT: To enable updating the table, set REPLICA IDENTITY using ALTER TABLE. LOCATION: CheckCmdReplicaIdentity, execReplication.c:575 Temporary Recovery # The error message was clear: no replica identity. The table was a partitioned table and a 2026 partition, so I immediately suspected the new partition lacked a primary key. (A new table\u0026rsquo;s replica identity defaults to default, which only uses a primary key as the replica identity. Without a primary key, updates are impossible.)\nFurther investigation revealed: the parent table had no primary key or indexes, child partitions from 2025 and earlier had both primary keys and indexes, but 2026 and later child partitions had neither — and all child partitions were published. Roughly:\np_parent -- no PK, no indexes p_child_202511 -- has PK, has indexes, published p_child_202512 -- has PK, has indexes, published p_child_202601 -- no PK, no indexes, published p_child_202602 -- no PK, no indexes, published Since the parent table had nothing, a partition of child would also have nothing — you must manually create the primary key and indexes for each child partition. So the new partition creation was problematic; the old partitions presumably had them added after creation.\nAdditionally, publishing partitioned tables via the parent was only supported starting from PG13. Previously, you couldn\u0026rsquo;t publish via the parent — only via child tables. This database was on PG11.\nAllow partitioned tables to be logically replicated via publications (Amit Langote) § §\nPreviously, partitions had to be replicated individually. Now a partitioned table can be published explicitly, causing all its partitions to be published automatically. Addition/removal of a partition causes it to be likewise added to or removed from the publication. The CREATE PUBLICATION option publish_via_partition_root controls whether changes to partitions are published as their own changes or their parent\u0026rsquo;s.\nAfter the initial diagnosis and given the urgency, there were three ways to temporarily resolve:\nAdd primary keys to the 2026 partitions Set replica identity full on the 2026 partitions Remove the 2026 partitions from the publication Since recovery time was about the same for all options, we chose adding primary keys — the lowest operational cost — to at least stop the business errors.\nRoot Cause Analysis # The issue seems clear: \u0026ldquo;no replica identity + published + no primary key\u0026rdquo; prevents updates. But several questions still needed answers.\nQuestion 1: Why does the UPDATE fail even though there\u0026rsquo;s no 202601 data at all (the new partition has zero rows)? # The SQL text was:\nUPDATE tablzl_202601 SET idid = $1,... date_updated = now() WHERE mykey = $4 The partition key for tablzl_202601 is created_date. The SQL WHERE clause didn\u0026rsquo;t include the partition key, so when attempting to update the 202601 partition, it found no primary key and errored out.\nAs for whether row existence or replica identity is checked first, we can see from ExecSimpleRelationUpdate. This function has changed very little across PG versions:\n/* * Find the searchslot tuple and update it with data in the slot, * update the indexes, and execute any constraints and per-row triggers. * * Caller is responsible for opening the indexes. */ void ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate, TupleTableSlot *searchslot, TupleTableSlot *slot) { ... CheckCmdReplicaIdentity(rel, CMD_UPDATE); // check replica identity /* BEFORE ROW UPDATE Triggers */ if (resultRelInfo-\u0026gt;ri_TrigDesc \u0026amp;\u0026amp; resultRelInfo-\u0026gt;ri_TrigDesc-\u0026gt;trig_update_before_row) { slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo, \u0026amp;searchslot-\u0026gt;tts_tuple-\u0026gt;t_self, NULL, slot); if (slot == NULL)\t/* \u0026#34;do nothing\u0026#34; */ skip_tuple = true; } if (!skip_tuple) { List\t*recheckIndexes = NIL; /* Check the constraints of the tuple */ if (rel-\u0026gt;rd_att-\u0026gt;constr) ExecConstraints(resultRelInfo, slot, estate); if (resultRelInfo-\u0026gt;ri_PartitionCheck) ExecPartitionCheck(resultRelInfo, slot, estate, true); /* Materialize slot into a tuple that we can scribble upon. */ tuple = ExecMaterializeSlot(slot); /* OK, update the tuple and index entries for it */ simple_heap_update(rel, \u0026amp;searchslot-\u0026gt;tts_tuple-\u0026gt;t_self, slot-\u0026gt;tts_tuple); if (resultRelInfo-\u0026gt;ri_NumIndices \u0026gt; 0 \u0026amp;\u0026amp; !HeapTupleIsHeapOnly(slot-\u0026gt;tts_tuple)) recheckIndexes = ExecInsertIndexTuples(slot, \u0026amp;(tuple-\u0026gt;t_self), estate, false, NULL, NIL); /* AFTER ROW UPDATE Triggers */ ExecARUpdateTriggers(estate, resultRelInfo, \u0026amp;searchslot-\u0026gt;tts_tuple-\u0026gt;t_self, NULL, tuple, recheckIndexes, NULL); list_free(recheckIndexes); } } ExecSimpleRelationUpdate flow:\nCheck replica identity BEFORE ROW UPDATE triggers Check constraints (both non-partition and partition constraints) Update the row Insert index entries AFTER ROW UPDATE triggers So PG\u0026rsquo;s logic checks replica identity first, before row updates and everything else.\nEven though the SQL didn\u0026rsquo;t include the partition key, would adding it trigger partition pruning? The answer is: maybe not.\nPartition pruning improvements across versions:\nPG10 introduced declarative partitioning. There was no enable_partition_pruning parameter; pruning was done at planning time via constraint_exclusion. So PG10 had no query-execution-time pruning. PG11 added runtime partition pruning: Allow partition elimination during query execution (David Rowley, Beena Emerson). But it only supports pruning with bound variables, not non-immutable functions (including now()). PG14 added final pruning: This wins in UPDATEs on partitioned tables when only some of the partitions will actually receive updates. i.e., supports pruning with non-immutable functions. Since PG11 doesn\u0026rsquo;t support now() pruning, adding a now() condition to the business SQL wouldn\u0026rsquo;t trigger pruning — the error would still occur. However, if the business passed a bound variable, pruning would trigger and the error wouldn\u0026rsquo;t appear. Note: \u0026ldquo;the error wouldn\u0026rsquo;t appear\u0026rdquo; means updating 202512 data wouldn\u0026rsquo;t error out on the 202601 partition; updating 202601 data would still fail regardless.\nQuestion 2: The partition was created on 2025-12-26, so why was the problem only discovered on December 30? # This is even simpler: \u0026ldquo;no replica identity + published + no primary key\u0026rdquo; is an AND condition.\nAlthough the new partitions were created early, they were published on the evening of December 29 at 20:47:\ncat postgresql-12-29.csv.bak |grep \u0026#34;alter publication\u0026#34; 2025-12-29 20:48:07.730 CST,\u0026#34;userlzlreplication\u0026#34;,\u0026#34;lzldb\u0026#34;,xxx\u0026#34;statement: alter publication publzl add table \u0026#34;\u0026#34;public\u0026#34;\u0026#34;.\u0026#34;\u0026#34;tablzl_202601\u0026#34;\u0026#34;, \u0026#34;\u0026#34;public\u0026#34;\u0026#34;.\u0026#34;\u0026#34;tablzl_202602\u0026#34;\u0026#34;,... The first error appeared on December 29 at 22:26, about 1.5 hours later:\ncat postgresql-12-29.csv.bak |grep \u0026#34;REPLICA IDENTITY\u0026#34; 2025-12-29 22:26:01.404 CST,\u0026#34;userlzlreplication\u0026#34;,\u0026#34;lzldb\u0026#34;,375121,xxx,\u0026#34;cannot update table \u0026#34;\u0026#34;tablzl_202601\u0026#34;\u0026#34; because it does not have a replica identity and publishes updates\u0026#34;,,\u0026#34;To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.\u0026#34;,,,,\u0026#34;UPDATE tablzl Summary # Root cause overview: The parent table had no primary key, so partition of child partitions naturally also had none. Old child partitions had their primary keys added manually; new child partitions did not, resulting in the 202601 partition lacking a primary key. Logical replication relies on the primary key (default replica identity) for synchronization. Without replica identity, changes can\u0026rsquo;t be sent downstream, and UPDATE/DELETE statements on published tables cannot execute. In PG11, an UPDATE SQL that does include the partition key condition may still visit the new partition.\nA stroke of luck: Due to various factors, this problem was discovered early in this particular database. We had a one-day buffer on December 31 to fix all database instances, ensuring at least that January 1 new partition data updates wouldn\u0026rsquo;t error out. Otherwise, on January 1, 2026, multiple systems would have likely gone up in flames.\nTemporary measures (pick one):\nAdd primary keys to 2026 partitions Set replica identity full on 2026 partitions Remove 2026 partitions from the publication For replication pipeline optimization:\nTables without primary keys should be detected proactively, otherwise publishing them could cause business-side UPDATE failures For partition management strategy:\nPG\u0026rsquo;s partitioned tables are highly flexible, and developers generally don\u0026rsquo;t know how to create partitions correctly. Combined with significant new partitioning features across roughly PG10-15, and the lack of INTERVAL partitioning in PG, partitioned tables can end up a mess. Standardized management of partitioned tables is thus critical. For partition table features and operational tips, see: PostgreSQL Partitioned Tables\nAs for management tools, I\u0026rsquo;ll skip those.\nManagement goals:\nUse the parent table structure as the standard: the parent table, being developer-facing, should have primary keys, indexes, and replica identity (unless the PG version doesn\u0026rsquo;t support it) Keep parent and child tables consistent; use partition of to create new partitions (yes, I don\u0026rsquo;t recommend attach) Keep child tables consistent with each other Create new partitions in advance; partition data volumes should not be excessive Default partitions are not recommended; if created, their writes must be monitored Frequently accessed tables must have partition keys in their SQL queries and use partition pruning; otherwise, convert them to regular tables References # https://www.postgresql.org/docs/release/10.0/\nhttps://www.postgresql.org/docs/release/11.0/\nhttps://www.postgresql.org/docs/release/12.0/\nhttps://www.postgresql.org/docs/release/13.0/\nhttps://www.postgresql.org/docs/release/14.0/\nsrc/backend/executor/execReplication.c\nPostgreSQL Partitioned Tables\n","date":"Jan 4, 2026","externalUrl":null,"permalink":"/en/2026/01/04/case-partition-data-update-failure-on-2026-01-01/","section":"Posts","summary":"Symptoms # On December 30, business errors were reported — data could not be updated:\nERROR: 55000: cannot update table \"tablzl_202601\" because it does not have a replica identity and publishes updates HINT: To enable updating the table, set REPLICA IDENTITY using ALTER TABLE. LOCATION: CheckCmdReplicaIdentity, execReplication.c:575 Temporary Recovery # The error message was clear: no replica identity. The table was a partitioned table and a 2026 partition, so I immediately suspected the new partition lacked a primary key. (A new table’s replica identity defaults to default, which only uses a primary key as the replica identity. Without a primary key, updates are impossible.)\n","title":"Case: Partition Data UPDATE Failure on 2026-01-01","type":"posts"},{"content":"Paper: Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility\nGitHub: https://github.com/cmu-db/ext-analyzer\nPGConf: The trouble with extensions (PGConf.dev 2025)\nWhy This Paper # This is a survey of database extensions (mainly Postgres), covering the implementation approaches of extensions across different databases, existing problems, and most importantly, compatibility. The most significant finding: an evaluation of over 400 PostgreSQL extensions shows that 16.8% of extensions have compatibility issues with at least one other extension, potentially leading to system failures.\nAnalysis tools and results are on GitHub; Marco Slot\u0026rsquo;s presentation is at PGConf.\nExtension Categories # Extension Classification # The extension classification chapter is particularly lengthy — a single diagram actually clarifies everything.\nExtensions across 6 databases:\nPostgreSQL (1986): Written in C, designed from the beginning as an extensible architecture. Consequently, PostgreSQL has the richest and most diverse extensible ecosystem. MySQL (1994): Written in C++, best known for its storage engine plugin architecture. MariaDB (2009): A fork of MySQL, also C++ based, supporting more extensions than the original MySQL. SQLite (2000): Embedded database written in C, adaptable to various hardware devices and operating systems. Redis (2009): In-memory key-value store written in C++, uniquely extensible — only supports running above the DBMS key-value storage layer. DuckDB (2018): Embedded analytical database written in C++, with a rapidly emerging extensible ecosystem. Flexibility and Security # Extension security and flexibility are a trade-off — PG extensions are the most flexible but least secure; Redis is the most secure but least flexible:\nHow PostgreSQL Extensions Are Typically Implemented # PG generally has two ways to implement extensions:\nThrough handler functions, such as UDFs, UDTs, external tables, storage engines, and index access methods. Through hooks. Hooks are declared as function pointers in global variables; if a hook is set, it will call these pointers instead of its own code. Implementations may use both approaches — they\u0026rsquo;re not mutually exclusive. The other 5 databases have generally similar implementations, but none of them have hook-based implementations.\nExtensions may use different implementation approaches, e.g., function + types + index AM — this is the number of extensibility types. From Figure 1, we can see that extensions with 1-3 types are the most common, and the most-used implementation approach is function.\nFrom Table 3, 92.5% of extensions use UDFs — after all, it\u0026rsquo;s a user-facing feature, easiest to develop with the lowest barrier to entry. The least used is client authentication, as this scenario itself is uncommon.\nExtension Code Copy Rate # The paper also conducted an interesting survey: the extent to which extension code is copied from built-in code:\nOut of 441 extensions, 16.6% — 73 extensions — contain at least one line copied from PG source code. The detailed distribution is shown in the left chart above.\nWhy are so many extensions copying code? Because:\nSome functions in PG source are declared static, only callable within their own file, so they can only be copied. Due to the extension\u0026rsquo;s own requirements, functions may need slight adjustments, so they can only be copied and adjusted. And how much were these copied functions adjusted? See the right chart above.\nAs can be seen, unmodified copies are actually rare.\nIn summary, extension code is copied from PG source out of necessity, and the overall copy rate isn\u0026rsquo;t high.\nThe Heavyweight! — PG Extension Compatibility # This is the most interesting part of the paper: pairwise compatibility testing was conducted on 96 extensions, and testing found that 16.8% of extension pairs are incompatible!\nTesting methodology:\nInstallation. Yes, installation alone can cause problems. The authors tested both A→B and B→A installation orders, hence the asymmetric diagram. Running the extension\u0026rsquo;s provided unit tests. pgbench. Smoke testing. pgbench is of course simple, but good results here can still indicate something. Among the top 20 least compatible extensions, many commonly-used ones appear:\nCommon extensions: pg_hint_plan, vector, pg_show_plans, pgsentinel, pg_cron, pg_stat_kcache Heavy extensions: citus, timescaledb The fact that such extremely common and star extensions can have such poor compatibility is jaw-dropping.\nWhat\u0026rsquo;s even more chilling: this is just simple pairwise testing. Running 3-10 extensions should be the production norm, and production environments are far more complex and variable than the paper\u0026rsquo;s three testing methods.\nFinally, the paper identifies the reason for poor extension compatibility: extensions that use more components, extension types, and hooks are more likely to be incompatible with other extensions.\nNitpicking # It\u0026rsquo;s really still about Postgres The paper\u0026rsquo;s title says DBMS, but it\u0026rsquo;s mainly about PG compatibility. MySQL, Redis, etc. compatibility is only covered in the survey, with no experimental data at all. (Though the survey is interesting — you can learn how MySQL and Redis extensions are implemented.)\nOn the other hand, this paper has a kind of alternative \u0026ldquo;general-specific-general\u0026rdquo; feel: \u0026ldquo;DBMS-Postgres-DBMS\u0026rdquo; \u0026#x1f605;\nInsufficient compatibility testing PG has 400+ extensions, but only 96 were tested for compatibility, and only 1-on-1 compatibility testing, without tests involving 3 or more extensions. The compatibility testing isn\u0026rsquo;t particularly comprehensive.\nConclusion # PG extensions are indeed numerous and flexible — you\u0026rsquo;d struggle to find functionality that PG extensions don\u0026rsquo;t support. But the extensions themselves are almost in a state of \u0026ldquo;anarchy\u0026rdquo; — both extension development and usage have problems.\nFrom the compatibility results, extension compatibility is quite poor — even the installation order affects compatibility. Multiple extensions also depend on hook execution order; for example, two extensions both requiring themselves to execute last becomes awkward. \u0026ldquo;Having everything\u0026rdquo; doesn\u0026rsquo;t mean \u0026ldquo;install everything.\u0026rdquo;\nExtension Security Issues # PG extensions have virtually no security management, whether from inherently unsafe extensions or user privilege escalation through extensions.\nIf an extension contains unsafe languages, only the OS can restrict its behavior, not the DBMS.\nIf an extension can access user space, the OS layer cannot manage it.\nExtensions implemented through queries (e.g., UDFs) generally won\u0026rsquo;t bypass ACL policies. While UDFs are more secure, they\u0026rsquo;re not absolutely secure, as UDFs with admin privileges can exist.\nA single hook may not be restricted by ACL, because in PostgreSQL, ACL is only enforced at the planning and execution layers. PG provides SECURITY LABEL to restrict access control for objects (including extensions).\nPhilosophical Thoughts on Software Management # \u0026ldquo;If an extension contains unsafe languages, only the OS can restrict its behavior, not the DBMS.\u0026rdquo;\nThis statement itself isn\u0026rsquo;t wrong, but it carries an implication of \u0026ldquo;your directory could be deleted.\u0026rdquo; To counter this, consider the following:\nIf you use this software, you trust it, just like PG itself (but even when using PG, you create a postgres OS user rather than using root directly). As for extensions, treat them as part of the PG software. PG is trusted and can be installed directly in production because of its industry reputation. The same goes for extensions — choose reputable extensions rather than using them indiscriminately. This is essentially the difference between PostgreSQL community gatekeeping and extension provider gatekeeping. For cloud service providers, many extensions aren\u0026rsquo;t supported — the cloud provider assumes the gatekeeping function and the responsibility of taking the blame.\nVersion Convergence # PG extension versions have these characteristics:\nThe same extension may have different extension packages for different database versions. Extensions have different versions. This means that without version management, you\u0026rsquo;ll end up with unmanageable numbers of software versions. To address this, limiting specific PG versions to installing specific extension versions is a good approach. As for extension upgrades needed for certain requirements, implement them through PG version upgrades. This strategy sacrifices some flexibility to ensure stability. I personally think it\u0026rsquo;s worthwhile — the need to upgrade extensions itself isn\u0026rsquo;t common, but it can reduce many software management issues and unknown compatibility problems.\nConsider Compatibility When Using Extensions # Since extension compatibility isn\u0026rsquo;t great, managing extensions becomes especially important — we don\u0026rsquo;t want the database returning strange results or even crashing while running.\nExtension management strategy: 1. Install necessary extensions. 2. Create needed extensions on demand. 3. Don\u0026rsquo;t install obscure extensions. Search the compatibility matrix. While PG compatibility testing isn\u0026rsquo;t perfect, it\u0026rsquo;s still valuable. Since the paper isn\u0026rsquo;t directly searchable for the compatibility matrix, you can \u0026ldquo;ctrl+f\u0026rdquo; search the ext-analyzer compatibility table to preliminarily assess whether extensions you need have good compatibility. Trivia # In the 1976 INGRES paper, UDFs were already implemented through extensions. Even POSTGRES carried forward this functionality in its 1986 initial release. Oracle\u0026rsquo;s UDF implementation came in Oracle 7, released in 1992 — much later than PG.\nThe SQL standard didn\u0026rsquo;t include UDFs until 1996 — a full 20 years after INGRES\u0026rsquo;s UDF. Stonebraker indeed wasn\u0026rsquo;t very focused on driving standards.\nOriginal link: https://lastdba.com/2026/01/03/论文精读插件无政府状态/\n","date":"Jan 3, 2026","externalUrl":null,"permalink":"/en/2026/01/03/paper-deep-read-anarchy-in-the-database/","section":"Posts","summary":"Paper: Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility\nGitHub: https://github.com/cmu-db/ext-analyzer\nPGConf: The trouble with extensions (PGConf.dev 2025)\nWhy This Paper # This is a survey of database extensions (mainly Postgres), covering the implementation approaches of extensions across different databases, existing problems, and most importantly, compatibility. The most significant finding: an evaluation of over 400 PostgreSQL extensions shows that 16.8% of extensions have compatibility issues with at least one other extension, potentially leading to system failures.\n","title":"Paper Deep Read: Anarchy in the Database","type":"posts"},{"content":"","date":"Jan 3, 2026","externalUrl":null,"permalink":"/en/categories/%E8%AE%BA%E6%96%87%E8%A7%A3%E8%AF%BB/","section":"Categories","summary":"","title":"论文解读","type":"categories"},{"content":" Symptoms # The database showed a large number of row locks and a smaller number of LWLock LockManager waits. CPU was maxed out and active sessions spiked. The blocking PID associated with the locks kept changing, with no obvious long-transaction blocker. (Imagine high CPU and active sessions.)\nThe SQL corresponding to the large number of locks was as follows:\nUPDATE lzl_record SET rc_lzl1= rc_lzl1 + $1, pc_lzl2 = pc_lzl2 + $2, rc_lzl3 = rc_lzl3 + $3 where lzl_id = $4 Analysis # No Increase in SQL Concurrency Observed # From the correlation between hits and CPU, we can analyze from the SQL hit perspective. That UPDATE SQL accounted for about 80% of activity. The SQL\u0026rsquo;s execution count had not changed, but blks hit was clearly abnormal.\nWe also analyzed metadata access — within snapshots, no metadata tables showed unusually high access.\nFrom the symptom analysis, neither SQL concurrency increase nor metadata anomalies were apparent. The reason for the SQL hit increase wasn\u0026rsquo;t obvious at this point.\nLWLock LockManager Analysis # Since the SQL itself is simple — the lzl_id field in the lzl_record table is a unique field, meaning the update is done by unique key.\nIn addition to the large number of explicit locks, the wait events at the scene also included LWLock LockManager.\nHowever, the table is a regular table (not partitioned), with only 4 or 5 indexes on it.\nLWLock LockManager is related to not using the fast path. Simple queries and DML can use the fast path:\nWeak relation locks. SELECT, INSERT, UPDATE, and DELETE must acquire a lock on every relation they operate on, as well as various system catalogs that can be used internally. Many DML operations can proceed in parallel against the same table at the same time; only DDL operations such as CLUSTER, ALTER TABLE, or DROP \u0026ndash; or explicit user action such as LOCK TABLE \u0026ndash; will create lock conflicts with the \u0026ldquo;weak\u0026rdquo; locks (AccessShareLock, RowShareLock, RowExclusiveLock) acquired by DML operations.\nSo a SELECT/DML accessing no more than 16 relations (including indexes) should be able to use the fast path, and there shouldn\u0026rsquo;t be much LWLock LockManager.\nHowever, DML certainly can\u0026rsquo;t simply use the fast path — fast path handles lock operations entirely locally, but DML must check whether other sessions hold locks on the row and needs to access shared memory. Combined with the fact that this SQL updates by unique field yet still encounters row locks, it must be updating the same row.\nFrom the logs, we could see instances of updating the same row — one row had tens of thousands of lock-waiting updates.\nBenchmark Testing # Benchmarking Same-Row Updates to Reproduce LWLock LockManager # Given that row locks definitely can\u0026rsquo;t rely solely on the fast path, and knowing that LWLock LockManager degrades database performance, we benchmarked different scenarios.\n#prompt Give me a pgbench benchmark script Table structure: primary key, unique field + unique index, other fields Update: update by unique field Benchmark repeated updates on the same row (repeated row-lock updates) Benchmark random updates on different rows (no row-lock updates) Script omitted. Environment: 20 cores, 96GB RAM.\npgbench commands:\npgbench -h localhost -p $PGPORT -d lzldb -U dbmgr -f update_same_unique_key.sql -c 200 -j 32 -T 600 -r -S pgbench -h localhost -p $PGPORT -d lzldb -U dbmgr -f update_random_unique_key.sql -c 200 -j 32 -T 600 -r -S Wait events during the benchmark:\n-- Update same row, 2 typical samples usename | state | wait_event | wait_event_type | cnt ----------+--------+---------------------+-----------------+----- dbmgr | active | LockManager | LWLock | 105 dbmgr | active | transactionid | Lock | 61 dbmgr | active | tuple | Lock | 25 dbmgr | active | [null] | [null] | 8 dbmgr | active | WALSync | IO | 1 usename | state | wait_event | wait_event_type | cnt ----------+--------+---------------------+-----------------+----- dbmgr | active | transactionid | Lock | 180 dbmgr | active | LockManager | LWLock | 18 dbmgr | active | tuple | Lock | 1 dbmgr | active | WALSync | IO | 1 -- Update different rows, 2 typical samples usename | state | wait_event | wait_event_type | cnt ----------+---------------------+---------------------+-----------------+----- dbmgr | active | [null] | [null] | 106 dbmgr | idle | ClientRead | Client | 34 dbmgr | idle in transaction | ClientRead | Client | 25 dbmgr | active | WALWrite | LWLock | 21 dbmgr | active | BufferMapping | LWLock | 7 dbmgr | idle in transaction | [null] | [null] | 4 dbmgr | idle in transaction | WALWrite | LWLock | 2 usename | state | wait_event | wait_event_type | cnt ----------+---------------------+---------------------+-----------------+----- dbmgr | active | [null] | [null] | 117 dbmgr | idle | ClientRead | Client | 42 dbmgr | idle in transaction | ClientRead | Client | 24 dbmgr | active | WALWrite | LWLock | 12 dbmgr | active | XactGroupUpdate | IPC | 1 dbmgr | active | WALSync | IO | 1 dbmgr | active | XactSLRU | LWLock | 1 dbmgr | active | BufferContent | LWLock | 1 dbmgr | active | ClientRead | Client | 1 From the wait events, the difference is clear: updating the same row produces LWLock LockManager, sometimes at a high proportion. Updating different rows mostly just waits on CPU. Scenario 1 matches the production situation.\nA Brief Analysis of Row Locks and Fast Path # The lmgr README\u0026rsquo;s explanation of the fast path:\nFast Path Locking ----------------- Fast path locking is a special purpose mechanism designed to reduce the overhead of taking and releasing certain types of locks which are taken and released very frequently but rarely conflict. Currently, this includes two categories of locks: (1) Weak relation locks. SELECT, INSERT, UPDATE, and DELETE must acquire a lock on every relation they operate on, as well as various system catalogs that can be used internally. Many DML operations can proceed in parallel against the same table at the same time; only DDL operations such as CLUSTER, ALTER TABLE, or DROP -- or explicit user action such as LOCK TABLE -- will create lock conflicts with the \u0026#34;weak\u0026#34; locks (AccessShareLock, RowShareLock, RowExclusiveLock) acquired by DML operations. Conditions for locks that can use the fast path, from lmgr/lock.c:\n/* * The fast-path lock mechanism is concerned only with relation locks on * unshared relations by backends bound to a database. The fast-path * mechanism exists mostly to accelerate acquisition and release of locks * that rarely conflict. Because ShareUpdateExclusiveLock is * self-conflicting, it can\u0026#39;t use the fast-path mechanism; but it also does * not conflict with any of the locks that do, so we can ignore it completely. */ #define EligibleForRelationFastPath(locktag, mode) \\ ((locktag)-\u0026gt;locktag_lockmethodid == DEFAULT_LOCKMETHOD \u0026amp;\u0026amp; \\ (locktag)-\u0026gt;locktag_type == LOCKTAG_RELATION \u0026amp;\u0026amp; \\ (locktag)-\u0026gt;locktag_field1 == MyDatabaseId \u0026amp;\u0026amp; \\ MyDatabaseId != InvalidOid \u0026amp;\u0026amp; \\ (mode) \u0026lt; ShareUpdateExclusiveLock) SELECT/DML can use the fast path, but only for locktype=relation.\nLet\u0026rsquo;s look at the actual lock situation when there\u0026rsquo;s a row lock:\n-- Session 1 begin; update lzl1 set b=\u0026#39;zzz\u0026#39; where a=1; -- Session 2 begin; update lzl1 set b=\u0026#39;zzz\u0026#39; where a=1; -- waiting -- Session 3 select * from pg_locks where pid\u0026lt;\u0026gt;(select pg_backend_pid()) order by pid,locktype; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ---------------+----------+----------+--------+--------+------------+---------------+---------+--------+----------+--------------------+--------+------------------+---------+---------- relation | 4267681 | 5290151 | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 5/4791 | 220559 | RowExclusiveLock | t | t transactionid | [null] | [null] | [null] | [null] | [null] | 170706189 | [null] | [null] | [null] | 5/4791 | 220559 | ExclusiveLock | t | f transactionid | [null] | [null] | [null] | [null] | [null] | 170706190 | [null] | [null] | [null] | 5/4791 | 220559 | ExclusiveLock | t | f transactionid | [null] | [null] | [null] | [null] | [null] | 170706187 | [null] | [null] | [null] | 5/4791 | 220559 | ShareLock | f | f tuple | 4267681 | 5290151 | 0 | 1 | [null] | [null] | [null] | [null] | [null] | 5/4791 | 220559 | ExclusiveLock | t | f virtualxid | [null] | [null] | [null] | [null] | 5/4791 | [null] | [null] | [null] | [null] | 5/4791 | 220559 | ExclusiveLock | t | t relation | 4267681 | 5290151 | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 7/562 | 253641 | RowExclusiveLock | t | t transactionid | [null] | [null] | [null] | [null] | [null] | 170706187 | [null] | [null] | [null] | 7/562 | 253641 | ExclusiveLock | t | f virtualxid | [null] | [null] | [null] | [null] | 7/562 | [null] | [null] | [null] | [null] | 7/562 | 253641 | ExclusiveLock | t | t PG\u0026rsquo;s row lock implementation is quite complex — it involves not only tuple locks, but also transactionid and relation locks. Among these, only locktype=relation and virtualxid can use the fast path; all others cannot.\nCompare with the no-row-lock case:\n-- Session 1 begin; update lzl1 set b=\u0026#39;zzz\u0026#39; where a=1; -- Session 2 begin; update lzl1 set b=\u0026#39;zzz\u0026#39; where a=2; -- waiting select * from pg_locks where pid\u0026lt;\u0026gt;(select pg_backend_pid()) order by pid,locktype; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ---------------+----------+----------+--------+--------+------------+---------------+---------+--------+----------+--------------------+--------+------------------+---------+---------- relation | 4267681 | 5290151 | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 5/4792 | 220559 | RowExclusiveLock | t | t relation | 4267681 | 5290151 | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 5/4792 | 220559 | AccessShareLock | t | t transactionid | [null] | [null] | [null] | [null] | [null] | 170706214 | [null] | [null] | [null] | 5/4792 | 220559 | ExclusiveLock | t | f virtualxid | [null] | [null] | [null] | [null] | 5/4792 | [null] | [null] | [null] | [null] | 5/4792 | 220559 | ExclusiveLock | t | t relation | 4267681 | 5290151 | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 7/563 | 253641 | AccessShareLock | t | t relation | 4267681 | 5290151 | [null] | [null] | [null] | [null] | [null] | [null] | [null] | 7/563 | 253641 | RowExclusiveLock | t | t transactionid | [null] | [null] | [null] | [null] | [null] | 170706212 | [null] | [null] | [null] | 7/563 | 253641 | ExclusiveLock | t | f virtualxid | [null] | [null] | [null] | [null] | 7/563 | [null] | [null] | [null] | [null] | 7/563 | 253641 | ExclusiveLock | t | t There are only 2-3 fewer fastpath=f entries. The transactionid locks held by both sessions definitely can\u0026rsquo;t use the fast path.\nSummary of conditions for using the fast-path lock mechanism (all must be met):\nLock level \u0026lt;= 3, i.e., SELECT/DML statements locktype=relation. PG\u0026rsquo;s row locks also require at least transactionid and tuple locks, so these two can\u0026rsquo;t use the fast path Fewer than 16 relations accessed (typically exceeded only with full partition access on partitioned tables) Conclusion # Is the row lock the cause or the effect? Is it a row lock problem, or did database performance degrade causing SQL to run slower and produce row locks? Row lock is the cause. The SQL execution count didn\u0026rsquo;t change, but the SQL parameters shifted from scattered to concentrated — i.e., updates to the same row noticeably increased. From the benchmark data, updating the same row produces row lock and LWLock LockManager waits.\nSQL execution count didn\u0026rsquo;t increase — did SQL performance degrade? SQL performance did degrade, but the index was definitely not chosen incorrectly — it was simply because the same row was being updated repeatedly.\nSolution:\nFrom the business side, the SQL was tied to a certain API endpoint: after being called, it updates the call count into the table. If the same endpoint is called repeatedly, it\u0026rsquo;s possible to repeatedly update the same row. Therefore, reducing repeated calls to the same endpoint, or batching the database updates into fewer, larger batches, is expected to mitigate this problem.\n","date":"Dec 21, 2025","externalUrl":null,"permalink":"/en/2025/12/21/case-study-row-locks-and-lwlock-lockmanager/","section":"Posts","summary":"Symptoms # The database showed a large number of row locks and a smaller number of LWLock LockManager waits. CPU was maxed out and active sessions spiked. The blocking PID associated with the locks kept changing, with no obvious long-transaction blocker. (Imagine high CPU and active sessions.)\nThe SQL corresponding to the large number of locks was as follows:\nUPDATE lzl_record SET rc_lzl1= rc_lzl1 + $1, pc_lzl2 = pc_lzl2 + $2, rc_lzl3 = rc_lzl3 + $3 where lzl_id = $4 Analysis # No Increase in SQL Concurrency Observed # From the correlation between hits and CPU, we can analyze from the SQL hit perspective. That UPDATE SQL accounted for about 80% of activity. The SQL’s execution count had not changed, but blks hit was clearly abnormal.\n","title":"Case Study: Row Locks and LWLock LockManager","type":"posts"},{"content":" As a DBA # As a DBA, I strongly believe in first principles and information theory when it comes to problem analysis. A DBA needs to deeply understand the system, understand PostgreSQL, to explain anomalies from first principles. For example, in the first half of the year I spent considerable effort understanding Linux memory, exploring the essence of memory issues and their solutions. At the same time, this year I took a step forward in system operations — no longer focusing solely on technical problems and handling, but more on providing solutions. These should encompass thinking across the PostgreSQL database technology dimension, the system dimension, and the management dimension.\nHere\u0026rsquo;s a simple classification of cloud DBA work:\nMany Ops papers only talk about incident handling, but in reality, incident handling probably accounts for less than 5% of actual operational workload. And whether in academia or practice, anomaly ops itself isn\u0026rsquo;t very effective anyway. So I\u0026rsquo;m not very bullish on AIOps being able to significantly help DBAs. Note that DBAs using AIOps and DBAs using AI are two different things.\nActually, this diagram is just so-so, because it doesn\u0026rsquo;t include leadership tasks, which are definitely the bulk.\nLooking back at the 2023 and 2024 year-end summaries, I can simply summarize my DBA work year by year:\n2023: Comprehensive PostgreSQL learning 2024: Comprehensive PostgreSQL operations 2025: Responsible for 1510 emotional value What\u0026rsquo;s deeply ironic is that last year\u0026rsquo;s conclusion — \u0026ldquo;DBAs are providing 1510 emotional value to their leaders\u0026rdquo; — became my lived reality this year. I don\u0026rsquo;t want to say more about it. In short, it\u0026rsquo;s been exhausting, mentally draining. I hope next year brings improvement.\nREADING # This year I read even more books than last year (from 20+ to 30+), but wrote even fewer reading notes. Writing is indeed troublesome and energy-consuming, and I\u0026rsquo;ve grown to prefer the feeling of reading itself. Compared to last year, this year\u0026rsquo;s reading shows a clear decrease in PostgreSQL technical books, an increase in comprehensive technical books, and I even started reading psychology, economics, and philosophy. In short, broader hunting grounds, not limited to databases alone. Also fewer novels — novels are like snacks, and I\u0026rsquo;m increasingly losing interest in such non-nutritious content.\nThis year\u0026rsquo;s book list generally falls into: IT Systems, Economics, Popular Science, Spiritual, and Fiction categories. As with last year, ranked by personal preference.\nIT Systems Book List:\n\u0026ldquo;SRE: Google\u0026rsquo;s Approach to Service Reliability\u0026rdquo; — DBAs are not SREs, but their work involves system stability objectives, which has similarities with DBA work. Some content in this book about cloud environments or management aspects was truly enlightening — for example, SLA, systems engineering, operational pressure, busy work, role rotation, \u0026ldquo;trust the team rather than a single technical expert,\u0026rdquo; and more. Absolutely brilliant. Recently I also heard the term DBRE — Database Reliability Engineer — which fits my current role even better than DBA. In short, an excellent book, a must-read for modern ops.\n\u0026ldquo;Running Linux Kernel: Introduction\u0026rdquo; — operating open-source databases requires understanding the operating system. One of my books for studying Linux memory.\n\u0026ldquo;Deep Understanding of Linux Processes and Memory\u0026rdquo; — one of my books for studying Linux memory.\n\u0026ldquo;Understanding the Linux Kernel\u0026rdquo; — one of my books for studying Linux memory.\n\u0026ldquo;Observability Engineering\u0026rdquo; — the patterns and flaws of traditional monitoring and traditional ops, and what observability essentially means. Quite helpful.\nEconomics Book List:\n\u0026ldquo;Microeconomics\u0026rdquo; — a masterpiece, by Daron Acemoglu. I consider it essential reading for life. This book has my best notes of any book. Not only understanding economics, but further understanding society. Some viewpoints left a deep impression on me:\nProves why the market is an invisible hand that maximizes social surplus value — any intervention reduces social surplus value. Under what circumstances markets are ineffective: externalities, public resources, and common-pool resources. Women earn less than men in the workplace partly because women bear children and cannot participate in production during that time. The function of academic credentials is signaling — to a certain degree, they certify the productive value of the person. Business entry and exit are normal market signals, not signs of disorder. The trade-off between equity and efficiency is a subject of study. \u0026ldquo;Why Nations Fail\u0026rdquo; — a masterpiece, by Daron Acemoglu. This book can be summarized in one sentence: Why do nations succeed? Because of creative destruction. Daron Acemoglu won the 2024 Nobel Prize in Economics for \u0026ldquo;research on how institutions are formed and how they affect prosperity.\u0026rdquo; What\u0026rsquo;s even more remarkable is that this book is easier to understand than other economics works. The top-recommended economics masterpiece.\n\u0026ldquo;The Rational Optimist\u0026rdquo; — said to rival \u0026ldquo;Sapiens,\u0026rdquo; but it\u0026rsquo;s definitely a notch below. However, the content quality isn\u0026rsquo;t bad, and it\u0026rsquo;s more economics-oriented. Some viewpoints are very fresh, for example:\nModern economics makes the rich richer, but the poor are not getting poorer. Self-sufficiency is poverty. What distinguishes humans from animals is barter exchange (in \u0026ldquo;Sapiens\u0026rdquo; it\u0026rsquo;s the cognitive revolution). Higher income leads to greater happiness — this is a fact. The elevation of trade in social status came from the rise of maritime trade, because land trade was unstable and easily plundered. \u0026ldquo;Reminiscences of a Stock Operator\u0026rdquo; — feels like I learned something and nothing at the same time. Decent read though.\n\u0026ldquo;Game Theory\u0026rdquo; — honestly, I found it average. Not much content, quite superficial. I mainly read it because economics books keep mentioning game theory, so I flipped through it to evaluate.\n\u0026ldquo;The Wealth of Nations\u0026rdquo; — extremely dense, not for normal people to read. Incredibly content-rich. Adam Smith must have been a genius — hard to imagine what kind of mind produced this. Too difficult for me, didn\u0026rsquo;t finish, gave up.\nPopular Science Book List:\n\u0026ldquo;A Brief History of Intelligence\u0026rdquo; — a masterpiece, essential reading for the AI era. This book is worn from my constant reading, covered in notes everywhere. Deconstructing the human brain, understanding what intelligence is, understanding how AI came to be. I give it full marks! Now whenever I see any animal, I first think about what intelligence level it\u0026rsquo;s at\u0026hellip;\n\u0026ldquo;On Top of Tides\u0026rdquo; — by Wu Jun. Every IT professional should read this book. It tells the rise and fall of major IT companies. You can learn about Oracle, Google, Fairchild, Bell Labs, and even basics about venture capital. Every company has its own DNA, which is nearly unchangeable and determines the company\u0026rsquo;s culture and characteristics. A programmer\u0026rsquo;s must-read.\n\u0026ldquo;The Almanack of Naval Ravikant\u0026rdquo; — has many useful perspectives, like views on marginal utility. And more importantly, it recommended one of my favorite books this year — \u0026ldquo;Microeconomics.\u0026rdquo; It also recommended meditation, which changed my habits.\n\u0026ldquo;How to Manage a Software Company\u0026rdquo; — by Frank Slootman, a legendary Silicon Valley CEO who led three software companies (ServiceNow, Data Domain, Snowflake) to successful IPOs. A very good book, looking at company development, employee management, execution, decision-making, and decision failures from an IT company manager\u0026rsquo;s perspective. Highly recommended.\n\u0026ldquo;The Economics of Aging\u0026rdquo; — by Kenichi Ohmae. Using Japan\u0026rsquo;s aging problem to glimpse China\u0026rsquo;s aging problems and opportunities. The demographic structural risks in our country are severe and about to come to a head. In this era, highly recommended reading.\n\u0026ldquo;The Fourth Wave\u0026rdquo; — by Kenichi Ohmae. Mainly about how Japan missed the IT technology wave, still relying on old industries to support the national economy, appearing somewhat envious of South Korea and China. I personally love the author\u0026rsquo;s attitude of directly criticizing the prime minister, haha.\n\u0026ldquo;The Checklist Manifesto\u0026rdquo; — explains the necessity of checklist inspections before Western surgical procedures. Seemingly simple steps can dramatically increase surgical success rates. This book had a big impact on my work — I genuinely brought the checklist concept into my work. I treat database operations like a surgical procedure — checklists are a simple yet necessary means to improve success rates.\n\u0026ldquo;The Mythical Man-Month\u0026rdquo; — \u0026ldquo;adding people\u0026rdquo; cannot linearly reduce systems engineering project timelines, but you also can\u0026rsquo;t simply reject \u0026ldquo;adding people\u0026rdquo; because large systems engineering projects genuinely require many people collaborating. It\u0026rsquo;s a good book, but calling it a programmer\u0026rsquo;s must-read feels like a stretch.\n\u0026ldquo;The Beauty of Mathematics\u0026rdquo; — by Wu Jun. Also quite good. Technology always has its mathematical foundations. This book accessibly tells the beauty of mathematics.\n\u0026ldquo;McKinsey Structured Thinking\u0026rdquo; — any problem should be structurally decomposed. When I encounter new problems, I think this way. A useful book.\n\u0026ldquo;The Chrysanthemum and the Sword\u0026rdquo; — stock from years ago that I dug out to read. An American\u0026rsquo;s post-WWII perspective on Japan. You can glimpse aspects of Japanese culture like modified Confucianism without \u0026ldquo;benevolence (ren),\u0026rdquo; the psychology of indebtedness, etc. One drawback is it\u0026rsquo;s quite dated — modern Japan is largely different from that era.\n\u0026ldquo;The Black Swan\u0026rdquo; — a black swan refers to unforeseen extreme events. Black swan events will always happen — there\u0026rsquo;s no such thing as 100% accurate prediction. It also discusses classification, which reminded me of content from \u0026ldquo;Structured Thinking\u0026rdquo; and \u0026ldquo;The Worlds I See\u0026rdquo;: \u0026ldquo;The essence of human understanding is classifying things,\u0026rdquo; but classification always awkwardly leaves some things unclassifiable or unable to be classified. Black swan events exist from the moment of classification. An interesting and noteworthy reflection.\n\u0026ldquo;The Professional\u0026rdquo; — by Kenichi Ohmae. Very mediocre, not recommended.\nSpiritual / Self-Help Book List:\n\u0026ldquo;The Evolution of Desire\u0026rdquo; — evolutionary psychology, a masterpiece. \u0026ldquo;Die with Zero\u0026rdquo; — experience the right things at different life stages. Even if you revisit something after missing it, it won\u0026rsquo;t feel the same as experiencing it at the right time. A life manual, highly recommended. \u0026ldquo;Ten Minutes Meditation\u0026rdquo; — mainly about the importance of meditation and how to do it. I learned meditation through this book. When I first completed meditation, I fell in love with it. It gave me a feeling of being taken to outer space and then returning to Earth. More importantly, it truly relieves stress. Meditation has become part of my life. \u0026ldquo;The Manipulation Bible\u0026rdquo; — okay. \u0026ldquo;Siddhartha\u0026rdquo; — incomprehensible, rubbish. \u0026ldquo;The Book of Life\u0026rdquo; — pure chicken soup, rubbish. Fiction Book List:\n\u0026ldquo;The Stranger\u0026rdquo; — a masterpiece. An indescribable sense of authenticity, feeling like an outsider oneself. \u0026ldquo;Yellowface\u0026rdquo; — a very interesting book about a white American woman who plagiarizes an unpublished work by a deceased Asian writer, even using a very Chinese pen name. When fans discover she\u0026rsquo;s white, you can feel the embarrassment. Playfully explores racial prejudice. As thrilling as watching a TV drama — twists and turns, gripping. Highly recommended. \u0026ldquo;The World of Yesterday\u0026rdquo; — by Stefan Zweig. Austria, Europe, WWI and WWII through a writer\u0026rsquo;s eyes. Returning to that turbulent Europe from a different angle. A very good book. \u0026ldquo;Project Hail Mary\u0026rdquo; — sci-fi. I increasingly dislike reading sci-fi. This one is okay: imagine you\u0026rsquo;re on an alien exploration mission, all your crewmates have died, and you happen to encounter a friendly alien. How do you communicate with them\u0026hellip; \u0026ldquo;Letter from an Unknown Woman\u0026rdquo; — by Stefan Zweig. Not good. Only the first story is somewhat novel. No interest in seriously reading the other two. \u0026ldquo;Satantango\u0026rdquo; — incomprehensible. Even Nobel Prize in Literature winners vary in quality. Blog and WeChat Official Account # The name of my WeChat Official Account has always been a struggle. I didn\u0026rsquo;t put much thought into maintaining it anyway, so I casually used a few names. This year I watched a documentary — \u0026ldquo;The Last Porter\u0026rdquo; (最后的棒棒), which moved me deeply. The DBA profession, like the porters of Chongqing, is undergoing tremendous change. So I simply changed it to \u0026ldquo;最后的DBA\u0026rdquo; (The Last DBA). This name rolls off the tongue nicely and carries some historical context and philosophical reflection. Seems like a good name.\nSince a lot of time goes into work, I didn\u0026rsquo;t have much time for writing to begin with. Plus, this year my operational approach kept changing, and no matter how I adjusted my daily schedule, I couldn\u0026rsquo;t carve out a good time slot. I even invested some money, and my time still didn\u0026rsquo;t increase, which frustrated me for quite a while. Looking back now, I only published 12 articles this year — not even one in the first half of the year.\nVery dissatisfied. \u0026#x1f620;\nI don\u0026rsquo;t know if my skills have improved or if the system is genuinely stable, but cases worth deep research seem to have become fewer. But this isn\u0026rsquo;t really a big problem. This year I also started treating paper interpretation as an article type. I personally feel the results are decent — I can learn quite a bit, without being too insular or reinventing the wheel. Using AI to interpret papers would certainly be fast, but I personally feel there are two problems:\nDo I truly understand? I feel like I don\u0026rsquo;t — it\u0026rsquo;s not the same concept as reading through it myself. Reading it yourself not only allows deeper understanding but also lets you discover all sorts of quirky details. Can\u0026rsquo;t pad articles. If I can interpret a paper with one prompt, then I feel the dissemination value is minimal — surely there\u0026rsquo;s no one not using AI now, right? Of course, I don\u0026rsquo;t read every paper word by word — that would be too inefficient. I only select papers that I feel are good and worth frame-by-frame interpretation, and savor them carefully.\nA quick summary of this year\u0026rsquo;s articles:\nToo few in quantity Slightly improved quality, and useful content (several articles I\u0026rsquo;m personally very satisfied with) Explored new formats Final Thoughts # This year was a busy one, with both bad and good memories. Many important things were left unfinished. Next year should bring significant changes. Writing this year-end summary is quite interesting — looking back to see what my past selves were up to is a fun experience.\nLast year\u0026rsquo;s 2025 OKRs:\nContinue some things — FAILED Think about how to produce output — FAILED Master another track — HALF SUCCESSFUL PostgreSQL\u0026hellip; haven\u0026rsquo;t figured out what more to do — FAILED Find a way to resume fitness — FAILED 2026 Plan:\nContinue some things Pay attention to my psychological and physical health — next year\u0026rsquo;s annual health inspection alerts should be lower than this year\u0026rsquo;s Pay attention to article readership, maintain the WeChat Official Account Explore DB AI Ops, report to myself next year Manage upward — don\u0026rsquo;t invest too much time in work Travel during holidays instead of grinding Read no fewer than 30 books, but don\u0026rsquo;t focus solely on quantity ","date":"Dec 21, 2025","externalUrl":null,"permalink":"/en/2025/12/21/my-2025-year-end-summary/","section":"Posts","summary":"As a DBA # As a DBA, I strongly believe in first principles and information theory when it comes to problem analysis. A DBA needs to deeply understand the system, understand PostgreSQL, to explain anomalies from first principles. For example, in the first half of the year I spent considerable effort understanding Linux memory, exploring the essence of memory issues and their solutions. At the same time, this year I took a step forward in system operations — no longer focusing solely on technical problems and handling, but more on providing solutions. These should encompass thinking across the PostgreSQL database technology dimension, the system dimension, and the management dimension.\n","title":"My 2025 Year-End Summary","type":"posts"},{"content":"Paper: DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs\nRepo: https://github.com/weAIDB/DBAIOps/\nWhat is DBAIOps # Why DBAIOps: Manual operations are extremely time-consuming. Manual operations are difficult to scale. Manual operations are often trapped in recurring failures. Documentation + RAG models are inaccurate (limited DBA experience integration). In short, both manual operations and existing solutions are mediocre, hence DBAIOps — an operations system combining LLM reasoning and knowledge graphs to achieve DBA-like diagnostic capabilities.\nComparison of database failure analysis approaches: Rule-based approach: Traditional, rigid. Machine learning approach: Essentially rule-based with similar limitations; depends on training data leading to lower generation capability; generally suitable for diagnosing common specific problems. LLM-based approach: Uses general documentation and LLMs (e.g., decision-tree-based), prone to giving generic results. LLM+RAG approach: Searches based on chunked top-k approximate knowledge; results are inaccurate. After comparing the above approaches, the advantages of DBAIOps combining graph knowledge, DBA experience, and LLMs are clear: Incorporates DBA experience. Preserves original relationships. Supports new root cause identification and solutions. Extensible. Overview # Left side is architecture, right side is an example.\nOffline: DBA experience is embedded into Neo4j, with the resulting graph model called ExperienceGraph, where edges represent anomaly phenomena or metric relationships. The embedded anomaly model is called AnomalyModel.\nOnline: Anomaly analysis, retrieval, and report generation. The AnomalyProcessor extracts standard failure information and AnomalyModel information, then retrieves the graph via ExperienceRetriever; finally, RootCauseAnalyzer calls the LLM to generate analysis reports.\nFrom the right-side example, we can see graph relevance finding LOG FILE SYNC associated with LOG WRITE performance and IO performance; through REDO ALLOCATION, we can find table structure changes and DDL.\nThe Operations Experience Graph Model # Unlike rule-based or document-chunk-based RAG, ExperienceGraph is a graph model encoding heterogeneous operations experience information. The graph contains three elements: (vertices, directed edges, relationships on edges).\nBased on the characteristics of operations experience, DBAIOps classifies vertices:\ntrigger vertex: Used to detect database anomalies; the entry point for anomaly analysis. For example, LOG FILE SYNC is an entry vertex. metric vertex: Database runtime metrics. For offline knowledge, this refers to metrics from operations case studies (if present). experience vertex: Encodes domain-specific operations experience, covering anomaly meanings and handling methods. For example, LOG FILE SYNC exceeding 60ms indicates overly frequent commits or parameter adjustments needed. tool vertex: Executable scripts for collecting and analyzing anomaly metrics. tag vertex: Semantic categories of graph vertices. For example, \u0026ldquo;Concurrent Transactions\u0026rdquo; involves multiple vertex types; tag vertices strengthen cross-case associations. auxiliary vertex: Explains the meaning of metrics. Edge classification:\ncontainment edge: Trigger Vertex - Experience Vertex relevance edge: Trigger Vertex - Metric Vertex diagnosis edge: Experience Vertex - Metric Vertex synonym edge: Only appears between Tag Vertices, indicating semantic synonymy, e.g., physical_read and disk_read; shared_pool and shared_buffer. Analyzing the operations experience graph model through an example:\nLOG FILE SYNC has multiple TAGs, and TAGs are associated with Experience, metrics, and tools. The strong relevance is evident — it represents a human DBA\u0026rsquo;s understanding and operations experience of LOG FILE SYNC.\nGraph Construction # Manual graph construction is unreliable, and existing ML-generated graphs may generate irrelevant relationships, so a semi-automatic graph generation approach is proposed.\nGraph initialization: This part is manually generated, defining trigger vertices according to rules. Once trigger vertices are generated, their associated metric vertices, experience vertices, etc., are automatically generated. This is somewhat like a human DBA guiding the creation of a knowledge sketch — the overall framework cannot be changed; nothing bizarre should be generated. Graph storage: Stored in Neo4J. Additionally, different database types are marked with tags, making much knowledge reusable and avoiding duplicate graph construction. Graph augmentation: Generating more edges. Graph updates: DBAIOps supports incremental updates. Updates here include both adding new vertices and removing old vertices. Anomaly Model # Metrics # Metrics come from many sources, including runtime information (CPU %, throughput, etc., routine monitoring), logs, traces, etc. Combined with relevance differences, strongly correlated metrics need to be extracted. So metrics are divided into 2 categories:\nImmediately collected metrics: Runtime information, logs, traces. Subsequently collected metrics: Periodic, delta, etc., metrics generated when needed, such as AWR/ASH data. Regarding metric-anomaly correlation, unlike baseline-based approaches, DBAIOps uses specific metric combinations for each anomaly type.\nFinally, a formula determines whether an anomaly has actually occurred:\nTwo-Stage Graph Evolution # Database anomalies rarely occur in isolation — one performance issue may simultaneously trigger or exacerbate others. However, connections between different anomaly models (e.g., LOG_FILE_SYNC and REDO_ALLOCATION) in pre-built knowledge graphs tend to be loose, with shared experience fragments sparse and fragmented. This makes it difficult for traditional methods to discover cross-model composite root causes, such as combined I/O bottleneck and memory pressure issues.\nTo address this challenge, DBAIOps proposes an automatic \u0026ldquo;graph evolution\u0026rdquo; mechanism that dynamically discovers and connects relevant experience fragments between different anomaly models, evolving the knowledge graph from an initially sparse structure into a densely interconnected network, thus supporting more comprehensive root cause analysis.\nStage 1 - Graph Inference and Proximity Discovery: Uses graph query language (Cypher) to collect and aggregate relevant metrics, traversing related nodes and edges based on configurable thresholds to build association networks. For example, starting from LOG_FILE_SYNC latency, traverse up to 3 hops of associated nodes. Establish connections between LOG_FILE_SYNC and REDO_ALLOCATION models because they are both related to I/O-related concurrency issues. Through multiple iterations, the knowledge graph gradually evolves into a denser structure, enabling diagnosis to consider more potential factors and composite causes.\nStage 2 - Adaptive Abnormal Metric Detection: Identifies truly anomalous metrics along graph expansion paths. Using an Adaptive Detection Function (ADF), it calculates composite anomaly scores considering dimensions such as metric volatility and dynamic baseline deviation. Based on anomaly scoring results, it decides whether further knowledge graph structure expansion is needed, filtering a precise subset of anomaly metrics for subsequent LLM root cause reasoning.\nGenerating Analysis Reports # Once the graph is ready, prompts need to be fed to the LLM to generate desired reports. A well-structured prompt can also improve report accuracy.\nAnomalies have 5 components, which serve as the prompt for the LLM:\nAnomaly: Anomaly description (\u0026ldquo;CPU usage spiked to 95% at 16:00 on 2023-10-05\u0026rdquo;) Condition: Anomaly trigger condition (\u0026ldquo;exceeds 90% for \u0026gt;5 min\u0026rdquo;) Metrics Experience: Provides normal load values or recent maintenance tasks. Output: Describes the report\u0026rsquo;s composition — anomaly verification (requiring further analysis), root cause analysis, recovery plan, summary, SQL text. Some personal thoughts:\nRecent maintenance tasks are very useful — maintenance tasks generally have strong correlation, and failure analysis can\u0026rsquo;t just be simple technical analysis. However, who updates these maintenance tasks and which ones to update or not update is a problem.\nThe first few items in output are easy to understand, but the last one — SQL text — is a stroke of genius. In production environments, aside from hardware failures, database runtime status is strongly correlated with SQL. I personally believe you can unthinkingly capture SQL and discuss causality later. From an operations perspective, failures always require joint investigation with developers, so SQL text is basically mandatory to capture.\nEvaluation # Comparison of analysis report quality across different tools and approaches:\nImpressive results. Notably, DBAIOps specifically emphasizes that mid-sized LLMs already produce good analysis results. This is important — DeepSeek-R1 671B running bare isn\u0026rsquo;t bad, but the cost is on a completely different level.\nNitpicking # Can\u0026rsquo;t really be called \u0026ldquo;Ops\u0026rdquo; — it only has failure analysis functionality. Ops content is vast; failure analysis is just the tip of the iceberg.\nGraph classification doesn\u0026rsquo;t match the graph example. The defined tag vertices and edges differ significantly from the example.\nThe vertices in the example play important roles, but these edge types aren\u0026rsquo;t defined: tag vertex-tool vertex, tag vertex-experience vertex, tag vertex-metric vertex. And the edges that should exist seem mostly absent, with only synonym edges present.\nUndescribed parts of the example should be listed, otherwise it\u0026rsquo;s confusing.\nThe two-stage graph evolution results are a bit odd: w/o ADF means without Stage 2 graph evolution (adaptive abnormal metric detection). w/o ADF should mean without Stage 1 graph evolution (graph inference and proximity discovery). w/o ADF means without either stage of graph evolution.\nHere, the case with both stages of graph evolution is missing — having it would better demonstrate the effectiveness of two-stage graph evolution.\nRoot causes are somewhat limited: The circled ones should be relatively common (I only looked at Oracle and Postgres), but these root causes are currently absent.\nPG\u0026rsquo;s root causes are a bit sparse. Dirty page flushing generally isn\u0026rsquo;t a major issue — as a root cause, it probably ranks behind many other root causes.\nSummary # Points I personally really like:\nGraphRAG should be better than vector RAG for failure diagnosis. (GraphRAG original paper: From Local to Global: A GraphRAG Approach to Query-Focused Summarization)\nSS represents vector RAG, TS represents source text summaries, and C0/C1/C2/C3 represent GraphRAG at different knowledge granularities. From this chart, we can simply conclude: GraphRAG is better suited for multi-document complex scenarios and multi-angle analysis, but may not necessarily outperform vector RAG in precision.\nSemi-automatic graph generation approach. Graph generation is semi-automatic — trigger vertices are manually created, others can be auto-generated. For example, LOG FILE SYNC is a trigger vertex. Failure entry points can indeed be made into obvious anomaly points — these are the entry points. Same for PG, same for any failure — it aligns with human logic for understanding failures.\nAutomatic graph evolution. Strengthening associations between certain vertices is meaningful, as evident from the \u0026ldquo;Performance of DBAIOps Variants\u0026rdquo; table.\nAutomatic baseline adjustment. In Observability Engineering, there\u0026rsquo;s this passage about AIOps:\nAI can only help when there are clearly discernible patterns and it can identify shifting baselines for prediction — such AIOps doesn\u0026rsquo;t exist yet.\nDBAIOps in my eyes:\nClearly discernible patterns = DBAIOps\u0026rsquo;s graph, which includes failure models, anomaly relationships, monitoring data, and logs.\nShifting baselines = DBAIOps\u0026rsquo;s adaptive abnormal metric detection.\nIn summary, it\u0026rsquo;s a significant advancement over random chunking of failure knowledge, setting a single baseline, and vector approximate search in RAG models.\nOriginal link: https://lastdba.com/2025/12/21/论文精读dbaio-ps/\n","date":"Dec 21, 2025","externalUrl":null,"permalink":"/en/2025/12/21/paper-deep-read-dbaiops/","section":"Posts","summary":"Paper: DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs\nRepo: https://github.com/weAIDB/DBAIOps/\nWhat is DBAIOps # Why DBAIOps: Manual operations are extremely time-consuming. Manual operations are difficult to scale. Manual operations are often trapped in recurring failures. Documentation + RAG models are inaccurate (limited DBA experience integration). In short, both manual operations and existing solutions are mediocre, hence DBAIOps — an operations system combining LLM reasoning and knowledge graphs to achieve DBA-like diagnostic capabilities.\n","title":"Paper Deep Read: DBAIOps","type":"posts"},{"content":" Problem Phenomenon # After physical migration to Xinchuang, occasional errors appear in the pg log, version pg15:\nWARNING: 01000: collation \u0026#34;zh_CN.utf8\u0026#34; has version mismatch DETAIL: The collation in the database was created using version 2.17, but the operating system provides version 2.28. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog.\u0026#34;zh_CN.utf8\u0026#34; REFRESH VERSION, or build RaseSQL with the right library version. LOCATION: pg_newlocale_from_collation, pg_locale.c:1660 Context: During the physical switch, invalid index rebuilding and refresh database collation version were performed.\nAlthough the libc version was upgraded after physical migration, indexes were rebuilt and are now valid, and the collation version in the database is already consistent with the OS libc.\nSo,\nWhy is the error reported?\nWhere is the error triggered?\nWhat is the impact of the error?\nHow to resolve it?\nProblem Analysis # Why is the error reported? # The collation inside the database mainly involves 3 aspects: database, columns, and indexes. The first two use default collation, and the index collation is the real collation.\nFirst, check the database collation. All databases use en_US.UTF8, and refresh database collation has already been done, so the \u0026ldquo;collation \u0026quot;zh_CN.utf8\u0026quot; has version mismatch\u0026rdquo; error should not be thrown at the database layer.\nThen check columns without specially specified default collation:\nselect attrelid,attname,attcollation from pg_attribute where attcollation not in (0,100,950,951); attrelid | attname | attcollation ----------+---------+-------------- (0 rows) 0 means no collation, default oid=100, C oid=950, POSIX oid=951; \u0026ldquo;zh_CN.utf8\u0026rdquo; definitely won\u0026rsquo;t be any of these four.\nFinally, check indexes without specially specified collation:\nselect * from (select indexrelid ,unnest(indcollation) coll from pg_index) i where coll not in (0,100,950,951); indexrelid | coll ------------+------ (0 rows) Having ruled out database, columns, and indexes, only one situation remains: the application layer specifies a sort rule:\nselect col1 from (values (\u0026#39;a\u0026#39;), (\u0026#39;A\u0026#39;), (\u0026#39;啊\u0026#39;), (\u0026#39;阿\u0026#39;)) AS l(col1) order by col1 collate \u0026#34;zh_CN.utf8\u0026#34;; WARNING: 01000: collation \u0026#34;zh_CN.utf8\u0026#34; has version mismatch DETAIL: The collation in the database was created using version 2.17, but the operating system provides version 2.28. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog.\u0026#34;zh_CN.utf8\u0026#34; REFRESH VERSION, or build RaseSQL with the right library version. LOCATION: pg_newlocale_from_collation, pg_locale.c:1660 col1 ------ 阿 啊 a A This zh_CN.utf8 version is inconsistent with the actual one:\nselect collname,collversion,pg_collation_actual_version(oid) from pg_collation where collname =\u0026#39;zh_CN.utf8\u0026#39;; collname | collversion | pg_collation_actual_version ------------+-------------+----------------------------- zh_CN.utf8 | 2.17 | 2.28 Not only zh_CN.utf8 is different, all are different (except a few collations without version concept).\nSo it\u0026rsquo;s very likely that the application itself specified a sort rule \u0026ldquo;zh_CN.utf8\u0026rdquo;, but the coll version in the database is inconsistent with the OS, which triggered the error.\nSource Code Understanding # The error message makes it easy to locate the source code position. Two main functions are of interest: pg_newlocale_from_collation and CheckMyDatabase.\npg_newlocale_from_collation Caching and Checking pg_collation # pg_newlocale_from_collation was introduced in pg10.\n/* * Create a locale_t from a collation OID. Results are cached for the * lifetime of the backend. Thus, do not free the result with freelocale(). * * As a special optimization, the default/database collation returns 0. * Callers should then revert to the non-locale_t-enabled code path. * In fact, they shouldn\u0026#39;t call this function at all when they are dealing * with the default locale. That can save quite a bit in hotspots. * Also, callers should avoid calling this before going down a C/POSIX * fastpath, because such a fastpath should work even on platforms without * locale_t support in the C library. * * For simplicity, we always generate COLLATE + CTYPE even though we * might only need one of them. Since this is called only once per session, * it shouldn\u0026#39;t cost much. */ /* locale_t means non-ICU. This function caches a locale_t type collation OID for the backend * the default/database collation returns 0. \u0026#34;default\u0026#34; means using the database\u0026#39;s collation */ pg_locale_t pg_newlocale_from_collation(Oid collid) // Note: passes in collation oid, not fetching all pg_collation { ... /* Return 0 for \u0026#34;default\u0026#34; collation, just in case caller forgets */ if (collid == DEFAULT_COLLATION_OID) // Three special collations: return (pg_locale_t) 0; // default oid=100, C oid=950, POSIX oid=951 ... if (cache_entry-\u0026gt;locale == 0) { ... collversion = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion, \u0026amp;isnull); // Get version from pg_collation data dictionary if (!isnull) { ... actual_versionstr = get_collation_actual_version(collform-\u0026gt;collprovider, collcollate); // Get actual version via get_collation_actual_version ... collversionstr = TextDatumGetCString(collversion); if (strcmp(actual_versionstr, collversionstr) != 0) // Compare data dictionary version and actual version, throw error if different ereport(WARNING, (errmsg(\u0026#34;collation \\\u0026#34;%s\\\u0026#34; has version mismatch\u0026#34;, NameStr(collform-\u0026gt;collname)), errdetail(\u0026#34;The collation in the database was created using version %s, \u0026#34; \u0026#34;but the operating system provides version %s.\u0026#34;, collversionstr, actual_versionstr), errhint(\u0026#34;Rebuild all objects affected by this collation and run \u0026#34; \u0026#34;ALTER COLLATION %s REFRESH VERSION, \u0026#34; \u0026#34;or build PostgreSQL with the right library version.\u0026#34;, quote_qualified_identifier(get_namespace_name(collform-\u0026gt;collnamespace), NameStr(collform-\u0026gt;collname))))); } ... return cache_entry-\u0026gt;locale; } The main check is: through the coll oid, check whether the version in the pg_collation data dictionary is consistent with the actual version; if inconsistent, throw an error.\nCheckMyDatabase Caching and Checking pg_database # CheckMyDatabase has existed for a long time, performing many database-side checks. However, pg15 added logic for checking the database version.\n/* * CheckMyDatabase -- fetch information from the pg_database entry for our DB */ static void CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connections) { ... /* Fetch our pg_database row normally, via syscache */ tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId)); ... default_locale.provider = dbform-\u0026gt;datlocprovider; // default is the db\u0026#39;s /* * Default locale is currently always deterministic. Nondeterministic * locales currently don\u0026#39;t support pattern matching, which would break a * lot of things if applied globally. */ default_locale.deterministic = true; // byte-order sensitive /* * Check collation version. See similar code in * pg_newlocale_from_collation(). Note that here we warn instead of error * in any case, so that we don\u0026#39;t prevent connecting. */ datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion, \u0026amp;isnull); // Get datcollversion from pg_database if (!isnull) { char\t*actual_versionstr; char\t*collversionstr; collversionstr = TextDatumGetCString(datum); actual_versionstr = get_collation_actual_version(dbform-\u0026gt;datlocprovider, dbform-\u0026gt;datlocprovider == COLLPROVIDER_ICU ? iculocale : collate); // Get actual version via get_collation_actual_version ... else if (strcmp(actual_versionstr, collversionstr) != 0) // Compare db datcollversion and actual version, throw warning if not equal ereport(WARNING, (errmsg(\u0026#34;database \\\u0026#34;%s\\\u0026#34; has a collation version mismatch\u0026#34;, name), errdetail(\u0026#34;The database was created using collation version %s, \u0026#34; \u0026#34;but the operating system provides version %s.\u0026#34;, collversionstr, actual_versionstr), errhint(\u0026#34;Rebuild all objects in this database that use the default collation and run \u0026#34; \u0026#34;ALTER DATABASE %s REFRESH COLLATION VERSION, \u0026#34; \u0026#34;or build PostgreSQL with the right library version.\u0026#34;, quote_identifier(name)))); } ... } The CheckMyDatabase function compares the datcollversion in the pg_database data dictionary with the actual version.\nFunction Differences # In pg14 and before, there was only 1 collation comparison logic: when a session first caches the corresponding collation, it calls pg_newlocale_from_collation to access the version of the corresponding collation in the pg_collation data dictionary and compare it with the real version. In PG15 and later, because the datcollversion field was added to the pg_database table, a new logic for checking db collation version was added: when a session first accesses the db in pg_database, it calls CheckMyDatabase to check the datcollversion of the corresponding database in pg_database and compare it with the real version. Why Are There Fewer Errors After Only Refreshing the Database? # After refreshing the database collation version, the warning about inconsistent pg_database coll version won\u0026rsquo;t be triggered, but it still cannot rule out the situation where pg_collation\u0026rsquo;s coll version is inconsistent. Why are there so many fewer errors after only refreshing the database? Could it be that pg_collation\u0026rsquo;s coll version simply won\u0026rsquo;t be loaded?\nselect c.coll,count(*) from (select unnest(indcollation) coll from pg_index ) c group by c.coll; coll | count ------+------- 950 | 37 --C 0 | 2841 --No collation 100 | 723 --default In real environments, default is the most used. Generally, no one specifies a collation; if not specified it\u0026rsquo;s default, and default is the database\u0026rsquo;s default collation.\nHere we need to revisit the pg_newlocale_from_collation function. The function starts like this:\npg_locale_t pg_newlocale_from_collation(Oid collid) { collation_cache_entry *cache_entry; /* Callers must pass a valid OID */ Assert(OidIsValid(collid)); /* Return 0 for \u0026#34;default\u0026#34; collation, just in case caller forgets */ if (collid == DEFAULT_COLLATION_OID) return (pg_locale_t) 0; ... When collid==DEFAULT_COLLATION_OID==100, it directly returns without executing the real version check below, so it won\u0026rsquo;t throw a warning. This logic is reasonable because the db coll version has already been verified when logging into the database; if there\u0026rsquo;s a problem, a warning must have already been thrown at the session layer.\nFurthermore, even if a possible value like collid=37 is passed in, the corresponding C also has no version concept.\nTherefore, after refreshing the database, in the vast majority of scenarios, as long as the database\u0026rsquo;s internal sorting is used (not expression sorting or specified index sorting), no error will be thrown.\nTesting # Here we only test whether there is a refresh warning, not testing index corruption or database crashes.\n# Check libc version getconf GNU_LIBC_VERSION Source host version glibc 2.17 Target host glibc 2.28 pg version pg15+ Test: Refresh db without refreshing pg_collation, only db coll version changes # select datname,datlocprovider,datcollate,datctype,datcollversion from pg_database datname | datlocprovider | datcollate | datctype | datcollversion ------------+----------------+-------------+-------------+---------------- lzldb | c | en_US.UTF-8 | en_US.UTF-8 | 2.17 select collname,collprovider,collversion,pg_collation_actual_version(oid) from pg_collation where collname ~ \u0026#39;en_US.utf8\u0026#39;; collname | collprovider | collversion | pg_collation_actual_version ------------+--------------+-------------+----------------------------- en_US.utf8 | c | 2.17 | 2.28 alter database lzldb refresh collation version; NOTICE: 00000: changing version from 2.17 to 2.28 LOCATION: AlterDatabaseRefreshColl, dbcommands.c:2399 ALTER DATABASE Check pg_collation and pg_database again:\ncollname | collprovider | collversion | pg_collation_actual_version ------------+--------------+-------------+----------------------------- en_US.utf8 | c | 2.17 | 2.28 datname | datlocprovider | datcollate | datctype | datcollversion ------------+----------------+-------------+-------------+---------------- lzldb | c | en_US.UTF-8 | en_US.UTF-8 | 2.28 Consistent with the official documentation description: refresh database collation version only refreshes the db\u0026rsquo;s default collation; pg_collation itself won\u0026rsquo;t change.\nTest: Refresh db without refreshing pg_collation, specifying expression sort reports warning # As analyzed at the beginning, expression sorting will report a warning, omitted.\nTest: Refresh db without refreshing pg_collation, creating a new index with specified collation reports warning # Test 1: Specify collation when creating index\ncollname | collversion | pg_collation_actual_version ------------+-------------+----------------------------- zh_CN.utf8 | 2.17 | 2.28 \u0026gt; create index idx11 on tt(a collate \u0026#34;zh_CN.utf8\u0026#34;); WARNING: 01000: collation \u0026#34;zh_CN.utf8\u0026#34; has version mismatch DETAIL: The collation in the database was created using version 2.17, but the operating system provides version 2.28. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog.\u0026#34;zh_CN.utf8\u0026#34; REFRESH VERSION, or build PostgreSQL with the right library version. LOCATION: pg_newlocale_from_collation, pg_locale.c:1664 CREATE INDEX Test 2: Specify column default collation when creating table, don\u0026rsquo;t specify when creating index\n\\c lzldb -- Reconnect a session You are now connected to database \u0026#34;lzldb\u0026#34; as user \u0026#34;postgres\u0026#34;. create table ttt(a varchar(10) collate \u0026#34;zh_CN.utf8\u0026#34;); CREATE TABLE \u0026gt; create index idxttt on ttt(a); WARNING: 01000: collation \u0026#34;zh_CN.utf8\u0026#34; has version mismatch DETAIL: The collation in the database was created using version 2.17, but the operating system provides version 2.28. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog.\u0026#34;zh_CN.utf8\u0026#34; REFRESH VERSION, or build PostgreSQL with the right library version. LOCATION: pg_newlocale_from_collation, pg_locale.c:1664 CREATE INDEX Time: 7.904 ms Column default collation and index specification of collation are essentially the same thing, both for specifying the index\u0026rsquo;s collation. Both can report warnings.\nTest: Refresh db without refreshing pg_collation, existing index with specified collation does not report warning # Scenario: The original database already has an index specifying collation zh_CN.utf8, different from the db. Refreshing the db won\u0026rsquo;t catch it. But after migrating to a new database, the vendor\u0026rsquo;s coll version definitely changed.\nselect collname,collprovider,collversion,pg_collation_actual_version(oid) from pg_collation where collname ~ \u0026#39;zh_CN.utf8\u0026#39;; collname | collprovider | collversion | pg_collation_actual_version ------------+--------------+-------------+----------------------------- zh_CN.utf8 | c | 2.17 | 2.28 Without using expression sorting, the index can be used, but index sorting cannot be used:\n\u0026gt; set enable_seqscan =off; SET \u0026gt; EXPLAIN ANALYZE SELECT a FROM tt ORDER BY a LIMIT 1000; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=6667.80..6670.30 rows=1000 width=33) (actual time=44.928..45.145 rows=1000 loops=1) -\u0026gt; Sort (cost=6667.80..6892.81 rows=90004 width=33) (actual time=44.926..45.021 rows=1000 loops=1) Sort Key: a Sort Method: top-N heapsort Memory: 127kB -\u0026gt; Index Only Scan using idxtt on tt (cost=0.42..1732.98 rows=90004 width=33) (actual time=0.029..15.434 rows=90004 loops=1) Heap Fetches: 4 Existing indexes with specified collation do not report warnings when used.\nSummary of This Problem # The refresh database and refresh collation warnings are session-level. In each session, for each database or each collation, it only reports once.\nOnly refreshing the database very likely won\u0026rsquo;t report warnings again, but there are situations where creating an index with a specified collation or running SQL with specified expression collation may still report warnings.\nThe coll version in the data dictionary is only for tracking whether the collation provider version has changed at the database layer. Imagine if there were no coll version in the data dictionary - the database might not even be able to return a warning saying \u0026ldquo;your sort rule provider has upgraded its version, your data sorting might have problems, you need to check it\u0026rdquo; (and of course it\u0026rsquo;s not just about sorting).\nSolutions for This Problem # Corrupt indexes have already been rebuilt, the database has been refreshed, only collation hasn\u0026rsquo;t been refreshed. The inconsistency of coll version in the data dictionary is not a big problem, it\u0026rsquo;s just a warning. As for other hidden and strange pitfalls, refer to the more section.\nSolution for this problem:\nStep 1: Check if there are still dependencies\nSELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS \u0026#34;Collation\u0026#34;, pg_describe_object(classid, objid, objsubid) AS \u0026#34;Object\u0026#34; FROM pg_depend d JOIN pg_collation c ON refclassid = \u0026#39;pg_collation\u0026#39;::regclass AND refobjid = c.oid WHERE c.collversion \u0026lt;\u0026gt; pg_collation_actual_version(c.oid) ORDER BY 1, 2; If there are returns, it\u0026rsquo;s best to rebuild the dependent objects; if not, follow step 2:\nSolution 1: Do nothing. If there aren\u0026rsquo;t many warnings, leaving them alone is fine. Solution 2: Only refresh collation zh_CN.UTF8. Fix one as it comes. Solution 3: Refresh all collations. Even if the application incrementally uses expressions or index-specified collation, no warnings will be reported. More # Key Summary of glibc Upgrade Related Issues # Locale is a very tricky area, and glibc upgrades cause many collation-related problems. Referencing reference materials, here\u0026rsquo;s a summary of some important points:\npg_collation is obtained from the OS command locale -a; the provider is basically glibc, so you need to look at the glibc version.\nIn pg_collation, \u0026ldquo;C\u0026rdquo; and \u0026ldquo;posix\u0026rdquo; have collprovider c, which looks the same as \u0026ldquo;C.UTF8\u0026rdquo; etc., but they\u0026rsquo;re not. \u0026ldquo;C.UTF8\u0026rdquo;\u0026rsquo;s provider is glibc, has a version, generally Unicode codepoint sorting or Unicode semantic sorting; \u0026ldquo;C\u0026rdquo; and \u0026ldquo;POSIX\u0026rdquo; are equivalent, the most basic locale defined by the POSIX standard, implemented by libc, not in locale -a, has no version, sorts directly by byte order.\nRoot cause of collation problems: The database requires that locale definitions never change during the database lifecycle, but OS vendors, especially the GNU C library, make changes to locale in every minor version, and this is legitimate.\nGNU C library makes changes to locale in every minor version. The version most prone to problems in reality is glibc 2.28, because 2.28 upgraded the major version unicode 9.0.0 (has been updated to a new upstream version from ISO which is in sync with Unicode 9.0.0).\npg has no way to detect compatibility issues caused by glibc upgrades. Index corruption checking is not an all-check, and indexes are only one aspect. After physical replication or upgrade, even if indexes are rebuilt, you cannot rule out the possibility that the database crashes one day due to collation version issues.\nData anomalies include: duplicate primary keys, sort-dependent constraints, range partition table data written to wrong partitions, mergejoin and other sort operations, etc.\nCharacter types depend on collation. Data types that don\u0026rsquo;t depend on collation:\nbytea tsvector gin indexes pg_trgm indexes numeric data types: int, bigint, numeric, float, \u0026hellip; custom data types like geometry (PostGIS) timestamp ASCII sorting is relatively common but doesn\u0026rsquo;t conform to human understanding, i.e., not semantic. Semantically conforming international sorting standards are generally Unicode standards.\nUnicode-based sorting rules are divided into 2 types: codepoint sorting, UCA (Unicode Collation Algorithm).\nUCA is based on DUCET (Default Unicode Collation Element Table). The DUCET table itself may have sorting changes between different versions. For example, en_US.UTF8 is UCA sorting, equivalent to semantic sorting; version upgrades will change sorting rules. C.UTF8 is codepoint sorting; once codepoints are confirmed they won\u0026rsquo;t change, and sorting rules won\u0026rsquo;t change.\nPG 17+ provides a very safe locale provider method: builtin, no longer depending on OS-provided glibc, ICU and other providers. Example enable command:\ninitdb --locale-provider=builtin --bultin-locale=C.UTF-8 dbname1 17 only supports C, C.UTF-8. C is byte-order sorting (approximately ASCII sorting), C.UTF-8 is Unicode codepoint sorting; 18 adds one more PG_UNICODE_FAST, also Unicode codepoint sorting, with slight differences from C.UTF-8.\nBecause the database must maintain stable sorting, custom application sorting can only be pushed to the application layer. For example, expression sorting is semantically clear and doesn\u0026rsquo;t affect the database\u0026rsquo;s own choice of collation. If one day pg also supports built-in en_US.utf8, then we can consider built-in semantic sorting.\nDuring Xinchuang migration, the glibc version of Xinchuang hosts is generally higher than old Intel server glibc versions, likely crossing the 2.28 version. Combined with tight deadlines, KPI pressure, insufficient manpower, and large databases, physical migration is unavoidable. So Xinchuang physical migration needs to pay attention to glibc versions and many anomalies caused by collation.\nWhat to Do After Physical Migration # Assuming the database is en_US.utf8, provider c, and physical migration across libc versions has already been done, the following operations should be performed:\nI. Official Required Solution\nAt minimum, rebuild problematic indexes. Install the amcheck extension and use the bt_index_check function: SELECT bt_index_check(\u0026#39;idx1\u0026#39;::regclass, true); Refresh database version (pg15+): ALTER DATABASE name REFRESH COLLATION VERSION Check if there are other dependent objects. If there are, handle them accordingly: SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS \u0026#34;Collation\u0026#34;, pg_describe_object(classid, objid, objsubid) AS \u0026#34;Object\u0026#34; FROM pg_depend d JOIN pg_collation c ON refclassid = \u0026#39;pg_collation\u0026#39;::regclass AND refobjid = c.oid WHERE c.collversion \u0026lt;\u0026gt; pg_collation_actual_version(c.oid) ORDER BY 1, 2; After handling, then:\nRefresh collation version (pg10+): ALTER COLLATION name REFRESH VERSION II. Unofficial Workaround Solutions\nI haven\u0026rsquo;t made a complete solution here, just some thoughts.\nHandling partition table data written to wrong partition: Partition key is int/bigint/float, no relation to collation, can be ignored.\nPartition key is time partition, if timestamp, can be ignored. If varchar or other character types, depends on the situation.\nPartition key is character type, refer to \u0026ldquo;a\u0026rdquo; and \u0026ldquo;-\u0026rdquo; sorting (pgconf Collation Challenges Sorting It Out). But note the following points:\nIf querying data, don\u0026rsquo;t query from the parent table; it might crash or fail to return results. There\u0026rsquo;s no simple detection solution. Handling primary key/unique key conflicts.\nHandling fdw sort range anomaly issues.\nUnknown problems.\nref # https://wiki.postgresql.org/wiki/Locale_data_changes\nhttps://wiki.postgresql.org/wiki/Collations\npgconf Collation Challenges Sorting It Out\nPFCONF Collations from A to Z\nhttp://www.unicode.org/reports/tr10/tr10-34.html\nhttps://sourceware.org/glibc/wiki/Release/2.28\nhttps://www.postgresql.org/docs/18/sql-altercollation.html\nhttps://www.postgresql.org/docs/18/sql-alterdatabase.html\nhttps://www.postgresql.org/docs/17/locale.html#LOCALE-PROVIDERS\n","date":"Dec 13, 2025","externalUrl":null,"permalink":"/en/2025/12/13/from-collation-mismatch-exception-to-its-principles/","section":"Posts","summary":"Problem Phenomenon # After physical migration to Xinchuang, occasional errors appear in the pg log, version pg15:\nWARNING: 01000: collation \"zh_CN.utf8\" has version mismatch DETAIL: The collation in the database was created using version 2.17, but the operating system provides version 2.28. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog.\"zh_CN.utf8\" REFRESH VERSION, or build RaseSQL with the right library version. LOCATION: pg_newlocale_from_collation, pg_locale.c:1660 Context: During the physical switch, invalid index rebuilding and refresh database collation version were performed.\n","title":"From collation mismatch Exception to Its Principles","type":"posts"},{"content":" PostgreSQL Logical Replication # ​​​​ （https://www.pgconf.asia/JA/2017/wp-content/uploads/sites/2/2017/12/D2-A7-EN.pdf）\nPostgreSQL places all logical decoding related matters entirely within the database\u0026rsquo;s replication slots for management — an all-inclusive approach. Early versions had somewhat limited logical replication support, but in recent major versions, logical replication has been one of the primary functional improvements.\nAdvantages of the PG approach:\nVery flexible: it exposes the logical decoding interface to users, with multiple types of decoding methods available. Users can subscribe to only the data they need based on their requirements. Disadvantages of the PG approach:\nThe number of concepts to learn and the learning cost are relatively higher compared to MySQL. Just the basic concepts — publication, subscription, walsender, replication slots, output plugins, etc. — I believe many people haven\u0026rsquo;t fully grasped their definitions and relationships. Does the hardest work and takes the hardest hits. All logical decoding problems are exposed within the database: WAL backlog, large transactions, long transactions, reorder transaction sorting, privilege issues, streaming transmission — these are all problems PG has to deal with. MySQL\u0026rsquo;s binlog # (https://blog.fasterinfo.top/6243.html)\nMySQL places all decoded logical data locally — in binlog files. The approach is simple. MySQL\u0026rsquo;s binlog is roughly equivalent to PostgreSQL with full-table logical replication enabled and written locally.\nAdvantages of the MySQL approach:\nSimple and straightforward: MySQL doesn\u0026rsquo;t expose the logical decoding interface directly to users. Instead, it provides already-decoded files directly to users, who don\u0026rsquo;t need to care about how parsing works — just read the binlog files. Mature ecosystem. I personally believe MySQL\u0026rsquo;s mature ecosystem is closely tied to binlog. During the internet era, PG\u0026rsquo;s logical replication was still weak, while binlog was extremely simple. Downstream parsing of binlog to put data onto other platforms became a common pattern. Disadvantages of the MySQL approach:\nAll data must be decoded; no customizable subscription. Poor flexibility. Two-phase commit. Because MySQL\u0026rsquo;s primary-standby replication heavily depends on binlog, binlog data must be fully flushed to binlog files at commit time. A single commit must write two (or two kinds of) logs — binlog and redolog. Dual log writes are one of MySQL\u0026rsquo;s eternal pain points. Oracle Logical Replication # （https://www.oracle-scn.com/oracle-goldengate-integrated-capture/）\nOracle itself does have logical Data Guard functionality, but virtually no one uses it. Here we\u0026rsquo;ll only discuss LogMiner. The Oracle database itself provides an interface like LogMiner for parsing logs (e.g., OGG integrated capture mode), but has zero replication link management itself — it relies on third-party tools to create and manage replication links.\nAdvantages of the Oracle approach:\nOnly provides a parsing interface, no replication link management. For the database itself, this is very hassle-free. Pay and you get a solution. Just buy the powerful OGG directly. Don\u0026rsquo;t say Oracle hasn\u0026rsquo;t provided a logical replication solution — we not only have one, it\u0026rsquo;s powerful and highly recognized. Disadvantages of the Oracle approach:\nRelies on third-party software to manage replication links. In summary, PG\u0026rsquo;s logical replication is an all-in-one, do-everything approach — very much in the open-source, technical spirit. MySQL\u0026rsquo;s approach is simple, crude, but effective — somewhat \u0026ldquo;one-step-to-finish.\u0026rdquo; Oracle\u0026rsquo;s approach is: provide an interface and leave everything else to third parties, but from the customer\u0026rsquo;s perspective, there is a mature solution available.\n","date":"Nov 30, 2025","externalUrl":null,"permalink":"/en/2025/11/30/a-brief-review-of-logical-replication-in-oracle-mysql-and-postgresql/","section":"Posts","summary":"PostgreSQL Logical Replication # ​​​​ （https://www.pgconf.asia/JA/2017/wp-content/uploads/sites/2/2017/12/D2-A7-EN.pdf）\nPostgreSQL places all logical decoding related matters entirely within the database’s replication slots for management — an all-inclusive approach. Early versions had somewhat limited logical replication support, but in recent major versions, logical replication has been one of the primary functional improvements.\nAdvantages of the PG approach:\nVery flexible: it exposes the logical decoding interface to users, with multiple types of decoding methods available. Users can subscribe to only the data they need based on their requirements. Disadvantages of the PG approach:\n","title":"A Brief Review of Logical Replication in Oracle, MySQL, and PostgreSQL","type":"posts"},{"content":"Paper: Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases\nSIGMOD best paper: https://sigmod.org/sigmod-awards/sigmod-best-paper-award/\nCXL and PolarDB-CXL # What is CXL # CXL: An open industry standard, a high-speed interconnect specification formulated by the CXL Consortium (founded in 2019 by tech giants Intel, AMD, ARM, etc.). It represents the evolutionary direction of computing architecture. Currently at CXL 4.0.\nFeature CXL 1.0/1.1 CXL 2.0 CXL 3.0/3.1 CXL 4.0 (latest) Release March/Sept 2019 October 2020 August 2022 / November 2023 November 2025 Base Protocol PCIe 5.0 (32 GT/s) PCIe 5.0 (32 GT/s) PCIe 6.0 (64 GT/s) PCIe 7.0 (128 GT/s) Max Bandwidth 1TB/s 1TB/s 2TB/s 4TB/s+ Topology Scale Point-to-point / simple star Single switch (≤32 nodes) Multi-level Fabric (4096 nodes) Ultra-large-scale Fabric From my research, two descriptions of CXL left the deepest impression:\nMemory as a Service Near-memory computing and expansion CXL switch: A switching chip, physical hardware. Many vendors are working on industrial implementations. The paper specifically references products from XConn Tech: CXL 2.0 switch. Note that as of November 22, 2025, XConn only has CXL 2.0 switches, no 3.0 products. However, there are products on the market supporting 3.0+ standards, such as Panmnesia CXL 3.2 Fabric Switch.\nPolarCXLMem: According to the paper, \u0026ldquo;the first CXL-switch-based disaggregated memory system.\u0026rdquo; But the paper also states \u0026ldquo;we leverage the world\u0026rsquo;s first CXL switch[50]\u0026rdquo; — specifically referring to the XConn tech CXL 2.0 switch — and then says \u0026ldquo;PolarCXLMem is the first CXL-switch-based disaggregated memory.\u0026rdquo; This can be interpreted in two ways:\nThe first disaggregated memory system based on CXL switches The first disaggregated memory system based on XConn tech CXL 2.0 switches PolarDB-CXL: The paper doesn\u0026rsquo;t actually use this term, but the industry uses it. It represents \u0026ldquo;integrate PolarCXLMem into the multi-primary version of PolarDB, known as PolarDB-MP\u0026rdquo; — essentially \u0026ldquo;the CXL-upgraded version of PolarDB-MP.\u0026rdquo; The paper repeatedly uses lengthy phrases but never uses the term polardb-cxl. For convenience, this article uses polardb-cxl to represent its essential meaning.\nRDMA vs CXL # PolarDB-MP uses RDMA architecture, while PolarDB-CXL uses CXL architecture:\n(https://medium.com/@anan.mirji/cxl-switch-vs-rdma-a-technical-comparison-for-high-performance-interconnects-6aaa031cde31)\nRDMA architecture is a cross-host distributed interconnect architecture, while CXL architecture is a single-host expanded interconnect architecture.\nKey differences:\nDimension RDMA Architecture CXL Architecture Topology Multi-host + network switch distributed arch Single-host + CXL switch expanded arch Communication Network (InfiniBand/RoCE) PCIe bus (CXL based on PCIe physical layer) Core Components RDMA NIC (dedicated NIC) CXL Controller, CXL Switch Resource Ownership \u0026ldquo;Remote resources\u0026rdquo; across independent hosts \u0026ldquo;Expanded resources\u0026rdquo; within the host architecture CXL\u0026rsquo;s Advantages # CXL\u0026rsquo;s advantages over RDMA:\nLow latency: CXL connects to host or device memory via PCIe; RDMA requires protocol interface conversion between InfiniBand and PCIe.\nInstruction support: CXL provides native load/store instructions, allowing the CPU to directly manipulate remote CXL device memory as if it were local memory. RDMA requires reading from remote memory to local memory, processing locally, then writing back to remote memory.\nSimplified applications: RDMA requires special interfaces and drivers, needing professionals to design complex programs; CXL provides transparent memory space, greatly simplifying application design.\nMemory fusion: CXL 3.0 supports physical hardware-level memory pooling.\nProblems with PolarDB-MP and the value CXL provides:\nCXL\u0026rsquo;s critique of MP:\nMemory pages are 4-16K, so even when only a small amount of data transfer is needed, data must move between local and shared memory, causing read/write amplification. Maintaining local memory adds extra memory overhead, reducing throughput. Recovery is very time-consuming. RDMA is far better than TCP/IP, but under high concurrency, it suffers from \u0026ldquo;doorbell register implicit contention\u0026rdquo; and \u0026ldquo;cache thrashing\u0026rdquo; issues. The database itself must maintain shared memory. Benefits CXL brings:\nEliminates the \u0026ldquo;shared memory - local memory\u0026rdquo; hierarchical memory structure, also eliminating the maintenance overhead and read/write amplification. Because CXL load/store to local memory is fast enough, it allows directly storing all buffer pages. Uses cache lines (64B) as the minimum transfer unit between CPU cache and main memory, rather than PolarDB-MP\u0026rsquo;s 4K pages. Saves main memory. DRAM costs are very high, roughly 40-50% of server/rack costs. Simplifies system design. Minimal modifications to existing systems are important for commercial database stability. PolarRecv: An instant recovery system built on CXL. After a database crash, data and metadata remain on CXL, allowing direct reads of consistent state from CXL memory, so recovery is very fast. (This seems similar to how PG\u0026rsquo;s page cache helps fast startup after a crash.) DRAM vs RDMA vs CXL:\nWhen data volume is small, RDMA has significantly higher latency than CXL; with larger data, RDMA\u0026rsquo;s latency improves slightly. Local DRAM access is slightly better than CXL access.\nOverall, CXL memory access latency is slightly higher than DRAM but better than RDMA.\nRegarding CXL\u0026rsquo;s higher latency vs DRAM, the paper explains: \u0026ldquo;database buffer pool operations are more sensitive to bandwidth than latency\u0026rdquo; — for database memory, bandwidth matters more than latency.\nCustom Rack # Self-developed physical prototype rack. The left rack integrates two CXL switch-enabled clusters, each connected to memory devices and hosts; the right rack integrates one CXL switch connected to memory devices and hosts.\nPolarCXLMem # The CXL 2.0 switch supports memory pooling, but the drivers don\u0026rsquo;t fully support it, so PolarCXLMem still designed its own CXL memory allocation and usage — it\u0026rsquo;s not fully transparent. PolarCXLMem processes CXL memory into a multi-tenant model, with different host nodes allocated different CXL memory regions.\nPolarCXLMem characteristics:\nNodes have their own CXL memory regions; different nodes\u0026rsquo; CXL memory does not overlap. The buffer pool is allocated at database startup (by the CXL mem manager in the diagram) and does not change during runtime. The memory unit structure in CXL mem is a block, which stores page data and page metadata, including: id (page id), lock state (whether the page is locked for update), prev/next (LRU doubly-linked list), lsn (latest log sequence number of the page). Free list / in-use list is used for LRU. Question: PG\u0026rsquo;s page header has lsn, starting free space pointer, prune xid, etc. What does PolarDB-CXL\u0026rsquo;s page header structure look like?\nPolarRecv # PolarDB-MP was designed based on RDMA, where data pages are written locally, and the disaggregated shared memory doesn\u0026rsquo;t contain the latest version of data pages. This means after a host crash, you must scan and apply all redo log files (the paper says redo, not WAL) or pages from a small amount of shared memory.\nCXL switches have independent power, so even if the host crashes, the latest data remains in CXL memory. PolarRecv leverages this to dramatically speed up database recovery after host crashes.\nHowever, while CXL switch memory is transparent and persistent, directly using it after a crash still requires handling these issues:\nLRU lists may be inconsistent at crash time B-tree SMO (B-tree structure changes), such as index splits, may be inconsistent at crash time Pages being updated at crash time may be inconsistent The redo log buffer uses local DRAM. When the redo log hasn\u0026rsquo;t been flushed to disk at crash time, the page LSN in the CXL buffer pool may be greater than the LSN in the redo log file, directly violating the ARIES principle PolarRecv\u0026rsquo;s design strategies:\nUse mutex to protect the LRU structure. The mutex lock state indicates whether LRU was being modified at crash time. If so, LRU must be rebuilt; if not, use the LRU directly from CXL memory. During B-tree SMO, a mini-transaction protects index pages. This mini-transaction is a two-phase lock corresponding to page locks. It\u0026rsquo;s only flushed to the redo log when the mini-transaction commits. So during recovery, if an index page is found with a write lock, recover from the redo logs. PolarCXL\u0026rsquo;s read/write locks are stored in CXL memory. If a write lock still exists, it means the update was in an intermediate state at crash time and not completed. In this case, honestly read the page from the redo log file rather than reading an inconsistent page from CXL memory. During recovery, first obtain the maximum LSN from the redo log, then check the lock and LSN of pages in CXL memory. If a page\u0026rsquo;s LSN in CXL memory is greater than the max LSN, rebuild the page using redo log information rather than using the CXL memory version. Memory Fusion # Because PolarCXLMem is designed based on the CXL 2.0 switch, and CXL 3.0 supports memory fusion, memory fusion design is still needed. Since each node\u0026rsquo;s buffer pool is placed in isolation in PolarCXLMem, CXL 2.0\u0026rsquo;s memory fusion is achieved through DBP metadata management — each buffer pool only stores the page\u0026rsquo;s CXL memory address, not the page itself.\nTo understand this diagram, you need to distinguish between CXL memory, DBP, and local buffer:\nCXL memory is the physical hardware, CXL mem itself. DBP is a region carved out of CXL for managing memory fusion services. Local metadata buffer contains local buffer metadata and part of CXL. Also understand that for each page in the buffer pool, there are two flags:\ninvalid: After another node writes to the page, the current node needs to invalidate its local CPU cache. removal: When a page moves from the in-use list to the free list, all nodes must set the removal flag. Memory fusion page access flow:\nThe requested page is not in the local page metadata buffer: 1.1 Allocate a new meta record from the free list, and provide invalid and removal addresses to the memory fusion service via RPC. The requested page is in the local page metadata buffer: 2.1 First check the removal flag. If removal is set, it means the memory fusion service has already reclaimed the page, and a new memory address must be requested from the memory fusion service via RPC. 2.2 Then check the invalid flag. If invalid is set, it means the page has been modified by another node, and the CPU cache must be invalidated to ensure consistency.\nFusion consistency:\nSince CXL 2.0 doesn\u0026rsquo;t have memory fusion, CPU caches aren\u0026rsquo;t automatically updated. PolarCXL implements multi-node concurrent write control through page-level locks.\nNodes must acquire read/write locks to read/write pages. When one node is writing to a page, other nodes cannot read or write that page. After a node finishes writing, it must also:\nFlush the CPU cache to CXL mem (cache line flush) to ensure CXL mem has the latest page version. Set the invalid flag to ensure other nodes don\u0026rsquo;t read stale page versions from their CPU caches. Memory fusion summary:\nCXL 2.0 itself supports incomplete memory fusion, meaning the database layer still needs to design a memory fusion scheme. Memory pages are accessed via CXL addresses, rather than local/remote access to entire pages as in the RDMA approach. The local CPU cache needs the database layer to flush it to ensure node data access consistency — this is a hard limitation. This also means cross-node updates still use exclusive page-level locks (the RDMA approach also uses exclusive page-level locks).\nPerformance Evaluation # Multi-Node Read/Write # Benchmarking with 12 instances on a 192 vCPU host, comparing RDMA (PolarDB-MP) vs CXL (PolarDB-MP with PolarCXLMem) performance:\nPoint queries:\nRange queries:\nRead-write:\nPoint queries: Read amplification is most severe for point queries. CXL\u0026rsquo;s bandwidth consumption is 3-4x lower than RDMA. When reaching 3 nodes, RDMA bandwidth is already saturated — adding more nodes doesn\u0026rsquo;t improve bandwidth. Range queries: Read amplification is less severe. Only at \u0026gt;4 nodes does it reach the bandwidth ceiling of 11GB/s, while CXL can still scale linearly with nodes. Read-write: Performance is similar to range queries, just with smaller differences. PolarRecv Recovery Time # vanilla: Refers to the general approach, probably similar to PG reading from local cache or disk (possibly polar redo). RDMA-based: Refers to PolarDB-MP where some data can be read from disaggregated shared storage. PolarRecv: Refers to continuing to read most data from CXL, with only a small amount of partial pages needing recovery from redo files. The paper discusses recovery time in 2 phases: startup/recovery and reaching pre-crash load levels. Read-only doesn\u0026rsquo;t need recovery — as long as there\u0026rsquo;s data, you can start and take load. When writes exist, recovery is needed, and the advantage of continuing to read from CXL memory becomes apparent. The difference between 1-minute, 2-minute, and 4-minute recovery times is significant — it could be the difference between business being nearly imperceptible and noticeably impacted.\nShared Data Updates # The focal point of distributed database performance combat is updates to shared data. After PolarDB-MP crushed Taurus-MM, PolarDB-CXL also crushed PolarDB-MP:\nAt 0% shared data, the RDMA-based solution just accesses local buffers, and PolarDB-CXL just treats CXL as a memory pool. Even so, CXL-based still performs better, mainly due to the read/write amplification and bandwidth ceiling issues of the RDMA-based solution mentioned earlier.\nFrom the performance comparison chart above, it\u0026rsquo;s clear that PolarDB-CXL significantly outperforms PolarDB-MP. The data is very clear. However, note that when shared data \u0026gt;60%, PolarDB-CXL\u0026rsquo;s performance improvement becomes less significant, mainly because:\nPage-level locks become the bottleneck. As lock contention intensifies, processes enter sleep states, and frequent context switching further exacerbates resource contention. Summary # PolarDB-CXL advantages:\nEliminates RDMA\u0026rsquo;s \u0026ldquo;local-remote\u0026rdquo; hierarchical memory structure design. Resolves RDMA\u0026rsquo;s read/write amplification problem. Provides a CXL-based memory pool. PolarRecv, based on CXL persistent memory, enables faster database crash recovery. Benchmarking shows PolarDB-MP CXL outperforms PolarDB-MP RDMA. PolarDB-CXL disadvantages:\nCross-node updates still use page-level locks, which remain the main performance bottleneck in shared data update scenarios. The CXL 2.0 switch seems a bit dated — by the time the paper was published, switch devices supporting 3.2 were already available, and CXL 4.0 was announced in November 2025. We can predict future databases built on newer CXL standard switch devices. The paper quality isn\u0026rsquo;t actually as high as the MP paper — it mainly revolves around solutions for the CXL 2.0 switch physical hardware, which differs from the extensive database-layer design found in the PolarDB-MP paper. Original link: https://lastdba.com/2025/11/30/论文精读polar-db-cxl2025-sigmod最佳工业论文/\n","date":"Nov 30, 2025","externalUrl":null,"permalink":"/en/2025/11/30/cxl-and-polardb-cxl/","section":"Posts","summary":"Paper: Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases\nSIGMOD best paper: https://sigmod.org/sigmod-awards/sigmod-best-paper-award/\nCXL and PolarDB-CXL # What is CXL # CXL: An open industry standard, a high-speed interconnect specification formulated by the CXL Consortium (founded in 2019 by tech giants Intel, AMD, ARM, etc.). It represents the evolutionary direction of computing architecture. Currently at CXL 4.0.\n","title":"CXL and PolarDB-CXL","type":"posts"},{"content":"Paper: PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory\nSIGMOD best paper: https://sigmod.org/sigmod-awards/sigmod-best-paper-award/\nForeword and Abstract # The paper opens with the problem: primary-replica architecture\u0026rsquo;s write throughput is limited by the primary. Shared-nothing architecture offers scalable multi-primary clusters that can solve the single-primary limitation, but this architecture suffers performance bottlenecks due to distributed transaction overhead. Recently, shared-storage-based cloud-native multi-primary databases have emerged, but under high-conflict scenarios, they face high conflict resolution costs and low data fusion efficiency.\nSo the problem is: single-primary primary-replica, shared-nothing, and shared-storage cloud-native multi-primary architectures all have their own issues.\nThis paper proposes PolarDB-MP, a novel multi-primary cloud-native database combining disaggregated shared memory with shared storage. (Since multi-primary cloud-native databases already exist, it needs to be \u0026ldquo;novel.\u0026rdquo;)\nPolarDB-MP\u0026rsquo;s basic characteristics:\nAll nodes can equally access all data, allowing transactions to be processed independently on a single node, without traditional distributed transaction mechanisms. Shared storage: PolarStore and PolarFS, or other compatible shared storage solutions. Built on disaggregated shared memory. Low-latency communication via RDMA (Remote Direct Memory Access). LLSN (Local Logical Sequence Number): Used to establish partial order for WAL logs generated by different nodes, combined with custom recovery strategies to ensure consistency and efficiency during abnormal recovery. Core component PMFS (Polar Multi-Primary Fusion Server) responsible for: Transaction Fusion — transaction ordering and visibility management Buffer Fusion — distributed shared buffer mechanism Lock Fusion — cross-node concurrency control Classification # The classification is mainly to understand PolarDB-MP\u0026rsquo;s historical position and the \u0026ldquo;first\u0026rdquo; qualifier:\nPolarDB-MP is the first multi-primary cloud-native database that utilizes disaggregated shared memory and shared storage for transaction coordination and buffer fusion\nCompetitor Weaknesses # Shared-nothing products: The paper doesn\u0026rsquo;t call out individual products, just one line: transactions accessing across multiple partitions require significant additional overhead for distributed transactions.\nOracle:\nExpensive distributed lock management Expensive network overhead Reliance on sophisticated hardware (alien tech) Difficult to migrate to cloud, or higher TCO (including maintenance and labor costs) compared to cloud-native databases after migration AWS Aurora-MM:\nUses optimistic transaction model; high transaction abort rates under conflicts In some scenarios, 4-node throughput is lower than single-node Huawei Taurus-MM:\nPessimistic transaction model. Relies on page storage and log replay to ensure cache consistency, with high overhead in concurrency control and data synchronization. Under 50% shared data read-write workload, 8 nodes only achieve 1.5x single-node performance improvement The Oracle critique here is mainly plausible-sounding trash talk, while Aurora-MM and Taurus-MM have original vendor citations:\nAurora-MM \u0026ldquo;in some scenarios, 4-node throughput is lower than single-node\u0026rdquo; Taurus-MM \u0026ldquo;under 50% shared data read-write workload, 8 nodes only achieve 1.5x single-node performance improvement\u0026rdquo; Transaction Fusion # Transaction Fusion Overview # How does multi-primary ensure consistent data views?\nSnapshot isolation is a common MVCC implementation. A characteristic of snapshot isolation is that queries or transactions must maintain their consistent data view during execution. But in multi-primary architecture, local nodes cannot guarantee consistent data views due to remote data updates.\nTo solve this, general multi-primary shared-storage architectures introduce global transaction mechanisms (Aurora-MM or Taurus-MM). PolarDB-MP introduces an innovative technique — transaction fusion within PMFS. Each node only maintains local transaction information, which can be accessed by other nodes via RDMA. In contrast to global transactions, transaction fusion is decentralized.\nLocal Transactions and TIT Table # Each node in PolarDB-MP maintains a small amount of memory to store local transaction information (accessible by other nodes via RDMA). This local transaction information is stored in the transaction Information Table (TIT).\nTIT table contents:\nTransaction object pointer Commit timestamp (CTS) assigned by the global timestamp coordinator (TSO) version, representing different transactions in the same slot ref, indicating whether this transaction is being waited on by other transactions for lock release (probably PLock or RLock) How Transactions Proceed # When a transaction begins, a local transaction id (presumably txid) is assigned, and the TIT slot stores the transaction object pointer, ref initialized to 0, and CTS initialized to CSN_INIT.\nPolarDB-MP uses a global transaction ID to identify a transaction: global transaction ID = (node_id, trx_id, slot_id, version). The global transaction ID does not include CTS. To know the commit order of transactions, such as when constructing a transaction visibility view, you need to go through the global transaction ID, via RDMA, to the target node to find CTS (similar to PG\u0026rsquo;s pg_xact_commit_timestamp() function, which finds the corresponding transaction commit time from local files using the transaction id).\nIf trx_id is the transaction ID in PG, then node_id + trx_id can identify the global uniqueness of a transaction, or node_id + slot_id + version could also work to some extent (when slot id is not reused, e.g., at a given moment it uniquely identifies a transaction). Of course, the extra information combined is also unique. After all, this information is key to PolarDB-MP\u0026rsquo;s transaction fusion implementation.\nEach transaction constructs a visibility view using the global transaction ID and CTS. The visibility view concept is consistent with PG: the current read view can read data rows committed before the read view, and the latest version rows.\nAccessing Remote CTS # Since CTS is local (in TIT or on the local filesystem), obtaining the reading transaction\u0026rsquo;s CTS is an interesting task:\n1.1 If a row\u0026rsquo;s CTS is CSN_INIT/CTS_INIT, meaning the transaction is still active, return the maximum CTS to indicate it\u0026rsquo;s invisible to all transactions except itself.\nIf a row\u0026rsquo;s CTS is not CSN_INIT/CTS_INIT, meaning the transaction has committed, and it\u0026rsquo;s in the local TIT, directly return CTS.\nIf a row has no CTS, obtain CTS via the row\u0026rsquo;s g_trx_id.\n2.1 If the transaction belongs to the local node (g_trx_id has node id), read from local filesystem to local TIT.\n2.2 If the transaction doesn\u0026rsquo;t belong to the local node, read from remote filesystem to remote TIT via RDMA.\n3.1 If slot.version != g_trx_id.version, the transaction must have committed, so the row is definitely visible to all transactions. Return minimum CTS to indicate visibility to all transactions.\n3.2 If slot.version = g_trx_id.version, refer to 1.1, 1.2.\nPolarDB-MP\u0026rsquo;s transaction visibility concept is very similar to PG\u0026rsquo;s, except PG uses txid instead of CTS to indicate transaction ordering and doesn\u0026rsquo;t need to consider remote access.\nRow Update Transactions # Additionally, row updates are also very similar:\nWhen PolarDB-MP updates a row, besides updating the data itself, it must also:\nUpdate the row\u0026rsquo;s global transaction ID (g_trx_id) (if it\u0026rsquo;s an in-row update, then it modifies PG\u0026rsquo;s row header). Update the row\u0026rsquo;s CTS. (The paper doesn\u0026rsquo;t specify whether this is in the row header or filesystem. If similar to PG, it should be in the commit_ts directory on the filesystem. Polar not confirmed.) Questions About Transaction Fusion (Things I Didn\u0026rsquo;t Understand) # g_trx_id is row metadata written to disk. If nodes are added or removed, does the node_id in the data row\u0026rsquo;s g_trx_id need updating? If not, which node should the row be loaded into when read next time?\nA new row\u0026rsquo;s CTS is stored on local node A. If another node B updates this row, is the new CTS on node A or B?\n\u0026ldquo;assigned a read view, which consists of its own g_trx_id and the current CTS.\u0026rdquo; Do read-only transactions also get assigned a g_trx_id when constructing a read view?\nWithout a doubt, a parameter like track_commit_timestamp must be forcibly enabled.\nIf there are many writes on node A and reads on node B, B\u0026rsquo;s reads will access A\u0026rsquo;s TIT data via RDMA — does this generate significant network IO? Should this be considered when designing read-write separation or multi-node reads and writes? The original paper might answer this — \u0026ldquo;Multi-primary architectures inherently require synchronizing large amounts of data and messages between nodes to support concurrent access across multiple nodes. As network technology develops (InfiniBand, RDMA) and achieves commercial deployment, the network bottleneck becomes less significant.\u0026rdquo;\nGlobal timestamps could become a bottleneck in distributed systems. PolarDB-SCC is a shared-storage-based timestamp solution that appears to perform well. Due to time constraints, I\u0026rsquo;ll set this aside for now.\nBuffer Fusion # Buffer Fusion Introduction # Each node in PolarDB-MP can update any data page, leading to substantial data transfer. Buffer Fusion\u0026rsquo;s distributed buffer pool (DBP) is designed to solve this problem. Each node has a local buffer pool (LBP), which is a subset of DBP.\nHow Buffer Fusion Works # LBP has two new metadata items for pages:\nvalid: whether the page has been updated by another node r_addr: pointer to the page in DBP When accessing a page from LBP, the current node must first check if the page is valid. If invalid, it must access DBP via r_addr. After DBP stores a new version of the page, buffer fusion invalidates all remote pages. In LBP, dirty pages are periodically flushed to DBP in the background or after releasing the PLock lock.\nPage access steps:\n1.1 If the page is in LBP and valid, access directly. 1.2 If the page is in LBP and invalid, access DBP via RDMA. 2. If the page is in neither LBP nor DBP, read from shared storage. 3. The page is loaded from a node into LBP and registered in DBP.\nPolarDB\u0026rsquo;s buffer fusion key component is disaggregated shared memory. It appears to be a/group of physical hardware or an integrated component built on top of it, separate from compute nodes. This differs significantly from memory in traditional distributed systems.\nIt\u0026rsquo;s also different from transaction fusion: transaction fusion requires accessing remote nodes with the same architecture, while buffer fusion doesn\u0026rsquo;t require accessing remote nodes with the same architecture — it separately accesses the disaggregated shared storage component.\nQuestions About Buffer Fusion (Things I Didn\u0026rsquo;t Understand) # Disaggregated shared memory seems like a component separate from standard hosts — so what exactly is it?\nLock Fusion # Lock Types in Lock Fusion # Buffer fusion solves how nodes access remote data; lock fusion solves concurrent access control.\nBuffer fusion has two types of locks:\npage-locking (PLock): Similar to latches, controlling atomic access and internal structure consistency. Single-node page access doesn\u0026rsquo;t use PLock. row-locking (RLock): Responsible for cross-node transaction control, following the two-phase lock protocol. PLock Access Flow # (The paper doesn\u0026rsquo;t say where lock fusion occurs. Since PLock is a page-level latch and page fusion happens on shared memory, I\u0026rsquo;ll assume lock fusion also occurs on shared memory, as this is easier to understand.)\nBefore updating/reading a page, the local lock manager checks whether the local node already holds the corresponding X/S PLock (or higher-level lock). 1.1 If yes, execute in place. 1.2 If no, acquire PLock through Lock Fusion. Lock fusion checks for conflicts before responding; if a conflict exists, the request waits. When PLock is released by a node, it notifies Lock Fusion, which updates PLock\u0026rsquo;s state and notifies other nodes to continue their operations. PLock Lazy Releasing # According to the PLock access flow above, a PLock is immediately released after local operations complete. This may not be optimal — according to temporal locality: \u0026ldquo;a data item or instruction accessed at a given time is likely to be accessed again in the near future.\u0026rdquo; Lazy releasing minimizes PLock lock RPC access load.\nThe principle is simple: PLock is not immediately released after use on the local node; it\u0026rsquo;s only released when ref reaches 0.\nWhen other nodes need PLock, Lock Fusion also sends negotiation messages to intervene when the local node is holding the lock; the local node must communicate with Lock Fusion rather than autonomously handling PLock. Lock Fusion uses a \u0026ldquo;first-in-first-out\u0026rdquo; strategy to resolve cross-node lock ownership, again until the local node\u0026rsquo;s ref = 0, at which point other nodes can acquire the lock.\nLazy releasing is an effective distributed lock solution, balancing local lock optimization with global lock allocation.\nRLock Overview # RLock uses the global transaction ID for determination (similar to PG). According to the transaction fusion content, the global transaction ID contains node id, transaction id, slot id, version. So when a local node reads a row, it can directly obtain the lock information on the row, know where the lock is (node id), and know if the lock is active.\nThere are two interesting points about determining transaction activity:\nFrom the transaction fusion flow of accessing remote CTS: if the transaction\u0026rsquo;s CTS is a valid value, or the transaction is in the same slot in TIT but not the same version, the transaction has definitely committed, so no need to check activity. If the source transaction is not active, there\u0026rsquo;s no need to wait for locks — proceed directly. PG has the concept of a minimum active transaction ID, which also exists in PolarDB-MP. If the transaction ID on the row is less than the global minimum active transaction ID, the source transaction must have also committed (or rolled back). How RLock Works # Local rows are handled locally; only conflicts are processed in Lock Fusion; cross-node row locks require RLock. \u0026ldquo;The transaction ID in the row functions as a lock indicator. So this protocol only supports exclusive (X) lock. The shared (S) lock on a row is not supported in PolarDB-MP, but it\u0026rsquo;s acceptable.\u0026rdquo; Only truly conflicting exclusive locks need RLock; shared locks don\u0026rsquo;t need RLock.\nT30 reads the row from shared storage and can determine from the row\u0026rsquo;s metadata (g_trx_id) that the transaction is active and which node it\u0026rsquo;s on. T30 remotely adjusts T10\u0026rsquo;s transaction ref. T30 sends a wait status to the Lock Fusion service. Lock Fusion adds wait information to the wait info table. T10 finishes execution and notifies Lock Fusion. Lock Fusion checks the wait info table, then notifies T30 it can continue. Questions About Lock Fusion (Things I Didn\u0026rsquo;t Understand) # \u0026ldquo;when attempting to update a row, it must already hold an X PLock lock on the page containing the row\u0026rdquo;\nUpdating also requires holding an exclusive PLock on the page, meaning updates on the same page block each other — doesn\u0026rsquo;t this affect concurrency? Locally, there shouldn\u0026rsquo;t be such behavior; PG doesn\u0026rsquo;t have page-exclusive locks for update scenarios.\nIn the \u0026ldquo;Logs ordering and recovery\u0026rdquo; chapter, there are two statements: \u0026ldquo;Thanks to the PLock design, only one transaction can update a page at a time\u0026rdquo; and \u0026ldquo;When a page is updated across two nodes, one node pushes its updated page to the DBP before releasing the PLock, allowing the next node to retrieve it from the DBP.\u0026rdquo;\nYes, during cross-node data updates, there are page-level exclusive locks.\nPMFS Summary (Hot Take) # PMFS (Polar Multi-Primary Fusion Server) is the core component implementing PolarDB-MP\u0026rsquo;s multi-primary distributed system. Among its features, the global transaction ID design is ingenious — it transforms PG\u0026rsquo;s transaction ID into one containing node information, transaction id, and transaction fusion\u0026rsquo;s slot and version information, placed in the row header. This has several benefits:\nDirectly accessing a row reveals the row\u0026rsquo;s version ordering. Directly accessing a row reveals which node updated it. Directly accessing a row reveals whether cross-node locks may exist. Uses minimum active transactions to reduce conflict determination. Uses global transaction ID information to achieve distributed retrieval of transaction commit timestamps (CTS). Additionally:\nBuffer fusion and lock fusion in PMFS appear highly dependent on the shared memory component. RDMA is omnipresent throughout. Log Ordering # Partial Order # First, WAL is generated on each node without any concurrency control mechanism — each writes independently to shared storage. Each node\u0026rsquo;s LSN is sequential for that node, but across multiple nodes, WAL records don\u0026rsquo;t exhibit global ordering.\nBut is global ordering needed when writing WAL records?\nFrom the paper, most of the time it\u0026rsquo;s not needed.\nOnly one case requires guaranteed global ordering during writing: cross-node updates to the same page.\nHowever, according to the PMFS lock fusion mechanism, cross-node updates to the same page are exclusive. Lock fusion can ensure the ordering of cross-node page updates.\nRecovery Ordering # Since LLSNs from cross-node writes come from multiple nodes and are likely not in order, recovery needs to be done in order. Reading all WAL records and sorting by LLSN is a simple approach, but massive sorting is very resource-intensive.\nPolarDB-MP proposes segment-wise sorting of LLSN — each segment is called a chunk, with chunk boundaries called LLSN bounds. PolarDB-MP can guarantee that an LLSN bound is always less than the next bound, then sort LLSNs within each chunk.\nQuestions About Log Ordering (Things I Didn\u0026rsquo;t Understand) # \u0026ldquo;utilizing redo (write-ahead) logs for data recovery and undo logs for rolling back uncommitted changes\u0026rdquo;\nPolarDB-MP has undo log files? What is this undo for?\nI didn\u0026rsquo;t see anything particularly special about LLSN; the paper doesn\u0026rsquo;t detail its structure. LSN seems sufficient — maybe there are differences regarding global transaction IDs.\nEvaluation # Read-only operations are all local, so adding nodes linearly increases throughput. If read-write/write-only data is well-partitioned and doesn\u0026rsquo;t cross nodes, it\u0026rsquo;s also nearly linear.\nThe problem lies in shared data across read-write/write-only nodes, which is the ultimate test of distributed database performance.\nThe paper directly compares against Huawei\u0026rsquo;s Taurus-MM. The conclusion: PolarDB-MP\u0026rsquo;s cross-node write performance is indeed significantly better.\nNitpicking # The paper mentions Taurus-MM\u0026rsquo;s performance improvement under 8-node shared data in two places, but the data is inconsistent:\nThe eight-node cluster only improves the throughput by 1.8× compared to the single-node version in the read-write workload with 50% shared data.\nthe throughput of Taurus-MM\u0026rsquo;s eight-node cluster is approximately 1.8× that of a single node under the SysBench write-only workload with 30% shared data, illustrating the trade-offs and challenges in optimizing multi-primary cloud databases\nSometimes 30% shared data, sometimes 50% — not very rigorous. The original Taurus MM paper says 50%:\nSummary # Not much to summarize — see the Foreword and Abstract and PMFS Summary sections.\nOriginal link: https://lastdba.com/2025/11/30/论文精读polar-db-mp2024-sigmod最佳工业论文/\n","date":"Nov 30, 2025","externalUrl":null,"permalink":"/en/2025/11/30/paper-deep-read-polardb-mp-2024-sigmod-best-industrial-paper/","section":"Posts","summary":"Paper: PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory\nSIGMOD best paper: https://sigmod.org/sigmod-awards/sigmod-best-paper-award/\nForeword and Abstract # The paper opens with the problem: primary-replica architecture’s write throughput is limited by the primary. Shared-nothing architecture offers scalable multi-primary clusters that can solve the single-primary limitation, but this architecture suffers performance bottlenecks due to distributed transaction overhead. Recently, shared-storage-based cloud-native multi-primary databases have emerged, but under high-conflict scenarios, they face high conflict resolution costs and low data fusion efficiency.\n","title":"Paper Deep Read: PolarDB-MP | 2024 SIGMOD Best Industrial Paper","type":"posts"},{"content":" Problem Description # The n_distinct statistic was severely inaccurate.\nThis problem appeared across multiple databases. For example:\nA table with 200 million rows and a true DISTINCT count of 8 million had a statistics DISTINCT value of only 40,000.\nAnalysis # Sampling Model # The default default_statistics_target=100 means 30,000 rows are sampled from 30,000 pages.\nanalyze verbose tablzl1; INFO: 00000: analyzing \u0026#34;public.tablzl1\u0026#34; LOCATION: do_analyze_rel, analyze.c:332 INFO: 00000: \u0026#34;tablzl1\u0026#34;: scanned 30000 of 22963751 pages, containing 1061942 live rows and 3953 dead rows; 30000 rows in sample, 812872389 estimated total rows LOCATION: acquire_sample_rows, analyze.c:1340 Note \u0026ldquo;scanned 30000\u0026rdquo; and \u0026ldquo;30000 rows in sample\u0026rdquo;.\nDISTINCT Estimation Algorithm # The DISTINCT estimation algorithm in analyze.c:\n/*---------- * Estimate the number of distinct values using the estimator * proposed by Haas and Stokes in IBM Research Report RJ 10025: *\tn*d / (n - f1 + f1*n/N) * where f1 is the number of distinct values that occurred * exactly once in our sample of n rows (from a total of N), * and d is the total number of distinct values in the sample. * This is their Duj1 estimator; the other estimators they * recommend are considerably more complex, and are numerically * very unstable when n is much smaller than N. * * In this calculation, we consider only non-nulls. We used to * include rows with null values in the n and N counts, but that * leads to inaccurate answers in columns with many nulls, and * it\u0026#39;s intuitively bogus anyway considering the desired result is * the number of distinct non-null values. * * We assume (not very reliably!) that all the multiply-occurring * values are reflected in the final track[] list, and the other * nonnull values all appeared but once. (XXX this usually * results in a drastic overestimate of ndistinct. Can we do * any better?) *---------- */ int\tf1 = nonnull_cnt - summultiple; int\td = f1 + nmultiple; double\tn = samplerows - null_cnt; double\tN = totalrows * (1.0 - stats-\u0026gt;stanullfrac); double\tstadistinct; n*d / (n - f1 + f1*n/N)\nn = number of sample rows (rows scanned) d = number of distinct values found in the sample f1 = number of values appearing exactly once in the sample N = total number of rows in the table Algorithm paper: https://hugepdf.com/download/download-extended-version-of-this-paper_pdf\nThe paper is rather dense, so let\u0026rsquo;s work through some assumptions to understand this DISTINCT algorithm:\nAssume all values appear exactly once, and the table is large (n \u0026laquo; N), so f1 = d, n/N ≈ 0 d*d / (d - d + d*0) = d²/0 — this would evaluate to -1.\nAssume all values appear exactly once, and the table is small (n = N), so f1 = d, n/N = 1 n*d / (n - d + d*1) = d — the sampled distinct count, which equals the number of sampled rows.\nAssume no values appear exactly once in the sample, i.e., f1 = 0 n*d / (n - f1 + f1*n/N) = n*d / n = d — just the distinct count in the sample.\nIf a column is populated by inserting several rows of the same value, then several rows of another value, like:\n11, 2, 2, 2, 2, 3, 3, 3, \u0026hellip;\n3.1 Small table, all 30,000 rows sampled, true distinct = 10,000 (assumed): estimated distinct = d = 10,000\n3.2 Large table, sample contains both repeating values and singletons (some repeating values only have one row captured), i.e., n = 30,000, n/N ≈ 0\nn*d / (n - f1 + f1*n/N) = n*d / (n - f1) = 30000*d/(30000-f1) — the larger the distinct count in the sample, the larger the estimated distinct; the larger the number of singletons, the larger the estimated distinct.\nSummary:\nDISTINCT estimation is directly related to the distinct count and singleton count in the sample If the singleton count = 0, then larger samples yield larger estimated distinct values Verification # Since the default maximum sample size is 30,000 rows, for tables larger than this, the estimator is likely to underestimate DISTINCT. Note: the data should not have too many unique values.\nTesting a table with different sample sizes:\nTable: reltuples = 800 million, relpages = 20 million, size = 175GB, true column distinct = 100 million\ntarget statistics pages sampling ratio (approx) tuples sampling ratio (approx) n_distinct execution time 50 0.00075 0.00001875 60k 2s 100 0.0015 0.0000375 110k 5s 1000 0.015 0.000375 1.03M 58s 3000 0.045 0.001125 2.68M 3min 1s 10000 0.15 0.00375 6.75M 7min 21s (maximum target statistics is 10000)\nA rough conclusion: n_distinct and ANALYZE execution time grow proportionally with the sample size.\nn_distinct grows with sample size, while pages and tuples estimates remain consistently accurate.\nSolution # For extremely large tables, consider partitioning or optimizing based on actual SQL patterns.\nYou can also adjust the statistics target. The default default_statistics_target=100 means 30,000 rows from 30,000 pages.\nTemporary fix:\nset default_statistics_target=3000; analyze tab1; Long-term fix:\nalter table tab1 alter column col1 set STATISTICS 3000; Notes:\nColumn-level statistics target has the highest priority, overriding default_statistics_target Maximum statistics target is 10000 The table\u0026rsquo;s sampling target is determined by the maximum column target: /* * Determine how many rows we need to sample, using the worst case from * all analyzable columns. We use a lower bound of 100 rows to avoid * possible overflow in Vitter\u0026#39;s algorithm. (Note: that will also be the * target in the corner case where there are no analyzable columns.) */ targrows = 100; for (i = 0; i \u0026lt; attr_cnt; i++) { if (targrows \u0026lt; vacattrstats[i]-\u0026gt;minrows) targrows = vacattrstats[i]-\u0026gt;minrows; } for (ind = 0; ind \u0026lt; nindexes; ind++) { AnlIndexData *thisdata = \u0026amp;indexdata[ind]; for (i = 0; i \u0026lt; thisdata-\u0026gt;attr_cnt; i++) { if (targrows \u0026lt; thisdata-\u0026gt;vacattrstats[i]-\u0026gt;minrows) targrows = thisdata-\u0026gt;vacattrstats[i]-\u0026gt;minrows; } } If ANALYZE collects more or fewer rows than expected, check pg_statistic for per-column stattarget settings:\nselect attrelid::regclass,attname,attstattarget from pg_attribute where attrelid = \u0026#39;tab1\u0026#39;::regclass and attstattarget not in (-1,0); Summary # For large tables where columns are non-unique but have high distinct counts (a realistic scenario), the sampling algorithm underestimates the DISTINCT value, and this is positively correlated with the sampling ratio. The default sampling ratio is too small for large tables. You can increase it, but even the maximum is not that large.\n","date":"Oct 19, 2025","externalUrl":null,"permalink":"/en/2025/10/19/case-from-inaccurate-distinct-to-the-principles-of-distinct-estimation/","section":"Posts","summary":"Problem Description # The n_distinct statistic was severely inaccurate.\nThis problem appeared across multiple databases. For example:\nA table with 200 million rows and a true DISTINCT count of 8 million had a statistics DISTINCT value of only 40,000.\nAnalysis # Sampling Model # ","title":"Case: From Inaccurate DISTINCT to the Principles of DISTINCT Estimation","type":"posts"},{"content":" Problem Description # An index was added the night before, and the next morning the CPU was maxed out. The problematic SQL was easy to locate — just one query. The SQL was running for over 30 seconds, but the day before it only took about 3 seconds, so we needed to examine the before-and-after execution plan changes.\nOnly the key parts of the execution plan are shown below.\nExecution plan before adding the index:\n-\u0026gt; Nested Loop (cost=19.92..2259694.20 rows=265822 width=33) -\u0026gt; Index Scan using uk_lzl_task on lzl_task t (cost=0.29..20007.99 rows=195 width=24) Filter: ((created_by)::text = \u0026#39;LIUZHILONG62\u0026#39;::text) -\u0026gt; Append (cost=19.63..11337.15 rows=14842 width=57) -\u0026gt; Bitmap Heap Scan on lzl_202501 cc_1 (cost=19.63..3053.69 rows=1467 width=66) Recheck Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Bitmap Index Scan on lzl_202501_task_no_idx (cost=0.00..19.27 rows=1594 width=0) Index Cond: ((task_no)::text = (t.task_no)::text) -\u0026gt; Bitmap Heap Scan on lzl_202502 cc_2 (cost=21.67..3066.85 rows=1604 width=66) Recheck Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Bitmap Index Scan on lzl_202502_task_no_idx (cost=0.00..21.27 rows=1605 width=0) Index Cond: ((task_no)::text = (t.task_no)::text) -\u0026gt; Index Scan using lzl_202503_task_no_idx on lzl_202503 cc_3 (cost=0.43..1362.61 rows=1637 width=57) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Index Scan using lzl_202504_task_no_idx on lzl_202504 cc_4 (cost=0.43..604.64 rows=1795 width=56) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Index Scan using lzl_202505_task_no_idx on lzl_202505 cc_5 (cost=0.43..445.30 rows=1450 width=56) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Index Scan using lzl_202506_task_no_idx on lzl_202506 cc_6 (cost=0.43..583.94 rows=1675 width=56) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Index Scan using lzl_202507_task_no_idx on lzl_202507 cc_7 (cost=0.43..633.45 rows=1973 width=56) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Index Scan using lzl_202508_task_no_idx on lzl_202508 cc_8 (cost=0.43..619.43 rows=1720 width=56) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) -\u0026gt; Index Scan using lzl_202509_task_no_idx on lzl_202509 cc_9 (cost=0.42..893.03 rows=1521 width=56) Index Cond: ((task_no)::text = (t.task_no)::text) Filter: ((created_date \u0026gt; \u0026#39;2025-01-07 09:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-03 12:56:44.973\u0026#39;::timestamp without time zone)) The created_date time range searches for data within 1 year. The index added the night before was on created_date.\nExecution plan after adding the index:\n-\u0026gt; Hash Join (cost=63.37..23740.82 rows=191 width=33) Hash Cond: ((cc.task_no)::text = (t.task_no)::text) -\u0026gt; Append (cost=0.00..23376.98 rows=114435 width=58) Subplans Removed: 28 -\u0026gt; Index Scan using idx_lzltab_202501_created_date on lzltab_202501 cc_1 (cost=0.43..1450.59 rows=8958 width=66) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202502_created_date on lzltab_202502 cc_2 (cost=0.43..1822.73 rows=7405 width=66) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202503_created_date on lzltab_202503 cc_3 (cost=0.43..1430.03 rows=7917 width=57) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202504_created_date on lzltab_202504 cc_4 (cost=0.43..2412.44 rows=11041 width=56) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202505_created_date on lzltab_202505 cc_5 (cost=0.43..2260.73 rows=13381 width=56) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202506_created_date on lzltab_202506 cc_6 (cost=0.43..3930.10 rows=17832 width=56) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202507_created_date on lzltab_202507 cc_7 (cost=0.43..3878.77 rows=21786 width=56) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202508_created_date on lzltab_202508 cc_8 (cost=0.43..4736.72 rows=22033 width=56) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Index Scan using idx_lzltab_202509_created_date on lzltab_202509 cc_9 (cost=0.42..627.09 rows=1893 width=56) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) -\u0026gt; Hash (cost=63.03..63.03 rows=27 width=24) -\u0026gt; Bitmap Heap Scan on ai_outbound_call_task t (cost=2.99..63.03 rows=27 width=24) Recheck Cond: ((created_by)::text = ($3)::text) -\u0026gt; Bitmap Index Scan on idx_ai_call_task_c (cost=0.00..2.99 rows=27 width=0) Index Cond: ((created_by)::text = ($3)::text) The new execution plan switched from using the task_no index to using the created_date index, and changed from a Nested Loop to a Hash Join. The cost dropped from 2,259,694 to 23,740 — a 100x reduction. However, the actual execution time increased by roughly 10x.\nProblem Diagnosis # Let\u0026rsquo;s work through three questions to analyze and diagnose the issue:\nWhy did the optimizer suggest the created_date index? Why did it end up using the new index? Why is the estimated row count very small even though the actual execution time is very long? Why Did the Optimizer Suggest the created_date Index? # If we directly substitute the parameters from the PostgreSQL log into the SQL text, the execution plan is actually the good one — the one that runs in 3 seconds using the task_no index. The optimization engineer also ran it this way and found it to be fine. But in production, this wasn\u0026rsquo;t the execution plan that was used.\nEven when we force PostgreSQL not to use the task_no index, the optimizer chooses a sequential scan rather than the created_date index:\nHash Cond: (((cc.task_no)::text || \u0026#39;\u0026#39;::text) = (t.task_no)::text) -\u0026gt; Append (cost=0.00..2794425.58 rows=22238757 width=57) -\u0026gt; Seq Scan on lzltab_202501 cc_1 (cost=0.00..193060.05 rows=1585238 width=66) Filter: ((created_date \u0026gt; \u0026#39;2025-01-08 11:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-04 08:31:43\u0026#39;::timestamp without time zone)) -\u0026gt; Seq Scan on lzltab_202502 cc_2 (cost=0.00..178567.54 rows=1480969 width=66) Filter: ((created_date \u0026gt; \u0026#39;2025-01-08 11:00:00\u0026#39;::timestamp without time zone) AND (created_date \u0026lt; \u0026#39;2025-09-04 08:31:43\u0026#39;::timestamp without time zone)) -\u0026gt; Seq Scan on lzltab_202503 cc_3 (cost=0.00..191073.34 rows=1583356 width=57) This is very strange: no matter how we ran it ourselves, we couldn\u0026rsquo;t get it to use the bad created_date index. So how did production end up using it?\nThe answer lies in bind variables — it was likely a generic plan.\nCharacteristics of the generic plan:\nWhen plan_cache_mode = auto, PostgreSQL compares the generic plan cost against the average cost of the first five hard parses (custom plans). If the generic plan has a lower cost, it is used and subsequent executions skip hard parsing; otherwise, every execution undergoes hard parsing (see the source function choose_custom_plan). What the generic plan looks like has nothing to do with the actual bind variable values. This is easy to reproduce using bind variables via PREPARE/EXECUTE:\nPREPARE sql1(timestamp without time zone,timestamp without time zone,text) AS SELECT COUNT(*) xxxxxxx...; =# EXECUTE sql1(\u0026#39;2025-01-08 11:00:00\u0026#39;,\u0026#39;2025-09-04 08:31:43\u0026#39;,\u0026#39;LIUZHILONG62\u0026#39;); count ------- 12016 (1 row) Time: 367.220 ms =# EXECUTE sql1(\u0026#39;2025-01-08 11:00:00\u0026#39;,\u0026#39;2025-09-04 08:31:43\u0026#39;,\u0026#39;LIUZHILONG62\u0026#39;); count ------- 12016 (1 row) Time: 254.386 ms =# EXECUTE sql1(\u0026#39;2025-01-08 11:00:00\u0026#39;,\u0026#39;2025-09-04 08:31:43\u0026#39;,\u0026#39;LIUZHILONG62\u0026#39;); count ------- 12016 (1 row) Time: 235.343 ms =# EXECUTE sql1(\u0026#39;2025-01-08 11:00:00\u0026#39;,\u0026#39;2025-09-04 08:31:43\u0026#39;,\u0026#39;LIUZHILONG62\u0026#39;); count ------- 12016 (1 row) Time: 234.110 ms =# EXECUTE sql1(\u0026#39;2025-01-08 11:00:00\u0026#39;,\u0026#39;2025-09-04 08:31:43\u0026#39;,\u0026#39;LIUZHILONG62\u0026#39;); count ------- 12016 (1 row) Time: 233.570 ms =# EXECUTE sql1(\u0026#39;2025-01-08 11:00:00\u0026#39;,\u0026#39;2025-09-04 08:31:43\u0026#39;,\u0026#39;LIUZHILONG62\u0026#39;); count ------- 12016 (1 row) Time: 70678.344 ms (01:10.678) -- 6th execution is significantly slower =# select * from pg_prepared_statements\\gx -- pg14 supports pg_prepared_statements generic_plans | 1 custom_plans | 5 The first 5 hard parses (custom plans) all executed quickly. The 6th execution used the generic plan, which used the created_date index — this was the exact production failure plan, which was extremely slow.\nSo while the optimization suggestion to use the created_date index was somewhat problematic, when you substituted bind variables with actual values and ran EXPLAIN, the execution plan was correct. In production, however, the application used bind variables, and the generic plan kicked in — causing the failure.\nWhy Is the Estimated Row Count Small But the Actual Execution Time Very Long? # The failing execution plan has a problem: the estimated cost is too small, and the estimated rows are too few.\n-\u0026gt; Index Scan using idx_lzltab_202501_created_date on lzltab_202501 cc_1 (cost=0.43..1450.59 rows=8958 width=66) Index Cond: ((created_date \u0026gt; $1) AND (created_date \u0026lt; $2)) From a business logic perspective, this looks abnormal. The created_date condition spans multiple partitions, and since created_date is the partition key, WHERE created_date \u0026gt;= xx AND \u0026lt;= yy must be contiguous. The selectivity on a sub-partition should always be 1, meaning rows should equal the sub-partition row count — several million, not several thousand.\nAt first I thought it was a statistics issue, but the statistics were fairly accurate — the historical partition data for 202501 hadn\u0026rsquo;t changed.\nSince this is a generic plan issue, we need to examine the generic plan cost estimation by reading the source code. Cost estimation is more complex, but rows estimation is relatively easier to understand and locate.\nstatic double calc_rangesel(TypeCacheEntry *typcache, VariableStatData *vardata, const RangeType *constval, Oid operator) { ... else { /* with any other operator, empty Op non-empty matches nothing */ selec = (1.0 - empty_frac) * hist_selec; } } /* all range operators are strict */ selec *= (1.0 - null_frac); range_select = (1 - null_frac) * histogram_selectivity. The range histogram selectivity looks at the histogram buckets hit by the range plus any matching MCV entries. However, we don\u0026rsquo;t need to compute all this for this case.\nBecause the generic plan does not look at the histogram:\n/* * rangesel -- restriction selectivity for range operators */ Datum rangesel(PG_FUNCTION_ARGS) { ... /* * If we got a valid constant on one side of the operator, proceed to * estimate using statistics. Otherwise punt and return a default constant * estimate. Note that calc_rangesel need not handle * OID_RANGE_ELEM_CONTAINED_OP. */ if (constrange) selec = calc_rangesel(typcache, \u0026amp;vardata, constrange, operator); else selec = default_range_selectivity(operator); ... } calc_rangesel is the selectivity calculation function that takes constant values (used above). The else branch calls default_range_selectivity, which does not pass any constants.\n/* * Returns a default selectivity estimate for given operator, when we don\u0026#39;t * have statistics or cannot use them for some reason. */ static double default_range_selectivity(Oid operator) { switch (operator) { ... case OID_RANGE_CONTAINS_ELEM_OP: case OID_RANGE_ELEM_CONTAINED_OP: /* * \u0026#34;range @\u0026gt; elem\u0026#34; is more or less identical to a scalar * inequality \u0026#34;A \u0026gt;= b AND A \u0026lt;= c\u0026#34;. */ return DEFAULT_RANGE_INEQ_SEL; } } The default range selectivity define:\n/* default selectivity estimate for range inequalities \u0026#34;A \u0026gt; b AND A \u0026lt; c\u0026#34; */ #define DEFAULT_RANGE_INEQ_SEL\t0.005 Let\u0026rsquo;s verify this against the production row estimate:\nselect reltuples::bigint*0.005 from pg_class where relname=\u0026#39;lzltab_202501\u0026#39;\\gx -[ RECORD 1 ]------ ?column? | 8958.350 This matches the actual estimated rows of 8958:\nidx_lzltab_202501_created_date on lzltab_202501 cc_1 (cost=0.43..1450.59 rows=8958 width=66) So the new execution plan\u0026rsquo;s inaccurate estimate is because the generic plan uses a default selectivity of 0.005.\nSummary # Why Does the Generic Plan Exist, and the Problem with Soft Parsing # It\u0026rsquo;s easier to think of the generic plan as a \u0026ldquo;DEFAULT estimate plan.\u0026rdquo;\nWhy does the generic plan always seem to have problems?\nLet\u0026rsquo;s trace the reasoning chain:\nThe generic plan exists to reduce hard parsing, i.e., to enable soft parsing. If we don\u0026rsquo;t hard-parse every execution, we can reuse an execution plan without passing specific parameter values. If we don\u0026rsquo;t pass parameters and directly use an execution plan, that plan must be generated in advance. Ways to generate an execution plan in advance:\nA parameter-less execution plan (the generic plan) Reuse an execution plan generated from the first few executions with parameters (PostgreSQL doesn\u0026rsquo;t have this) If we use a generic plan, it can be inaccurate, for example:\nData skew (e.g., a particular MCV has a very high frequency, like WHERE a = 1 but a = 1 appears extremely often). This heavily depends on what the parameter value actually is, but the generic plan receives no parameters, so the plan cannot be accurate. Evenly distributed data where selectivity still cannot be accurately calculated (e.g., WHERE a \u0026gt; $1 AND a \u0026lt; $2). Without knowing the range, no one can compute the selectivity. The generic plan receives no parameters, so the plan cannot be accurate. If we reused plans from the first few parameterized executions (which PostgreSQL doesn\u0026rsquo;t do), they could also be inaccurate:\nData skew: the first few parameter values may not be representative, and they would heavily influence what the subsequent fixed plan looks like. Categories of Generic Plan Estimation Problems # Because the comparison requires 5 custom plans first, generic plan problems can be divided into two categories:\nThe first 5 SQL executions are not representative. This is closely tied to the first 5 execution plans and depends on data skew and whether the first 5 parameter values are representative. The generic plan itself is problematic. Due to data skew or the inability to accurately compute selectivity for evenly distributed data, the generic plan itself is inefficient. Optimization Recommendations # Based on this case, generic plan issues can appear on partitioned tables. The partition key is contiguous, and selectivity when scanning all partitions should be 1, but the generic plan uses 0.005, which can easily lead to a \u0026ldquo;full index scan\u0026rdquo; scenario.\nSo during optimization, we need to consider more:\nAvoid creating too many indexes that confuse the optimizer. Eliminate generic plan interference. Use EXECUTE to truly run the query 6 times. At the session level, set plan_cache_mode = 'force_generic_plan' or set plan_cache_mode = 'force_custom_plan' to compare execution plans. Or, on pg16+, use EXPLAIN (GENERIC_PLAN) to compare. Syntax reference:\n--prepare/excute PREPARE sql1(text) AS SELECT COUNT(*) FROM LZL where a=$1; EXECUTE sql1(\u0026#39;zzz\u0026#39;); -- run 6 times first EXPLAIN EXECUTE sql1(\u0026#39;zzz\u0026#39;); select * from pg_prepared_statements -- view prepared statement info, current session only -- Compare execution plans by setting session parameters before EXPLAIN EXECUTE set plan_cache_mode=\u0026#39;force_generic_plan\u0026#39; set plan_cache_mode=\u0026#39;force_custom_plan\u0026#39; -- Directly view generic plan, pg16+ explain (GENERIC_PLAN) xx ","date":"Sep 13, 2025","externalUrl":null,"permalink":"/en/2025/09/13/case-study-performance-degradation-after-adding-an-index-and-the-generic-plan/","section":"Posts","summary":"Problem Description # An index was added the night before, and the next morning the CPU was maxed out. The problematic SQL was easy to locate — just one query. The SQL was running for over 30 seconds, but the day before it only took about 3 seconds, so we needed to examine the before-and-after execution plan changes.\nOnly the key parts of the execution plan are shown below.\n","title":"Case Study: Performance Degradation After Adding an Index and the Generic Plan","type":"posts"},{"content":" Problem Symptoms # The Symptom # A static historical table with no updates whatsoever — yet queries on the same-city standby consistently hit query conflicts:\nERROR: 40001: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. LOCATION: ProcessInterrupts, postgres.c:3197 Time: 30534.973 ms (00:30.535) Why a Query Conflict on a Static Table Matters # My understanding was that a static table should never experience conflicts (this understanding was wrong — I\u0026rsquo;ll explain later).\nThe official documentation lists the conflict cases:\nAccess Exclusive locks taken on the primary server, including both explicit LOCK commands and various DDL actions, conflict with table accesses in standby queries. Dropping a tablespace on the primary conflicts with standby queries using that tablespace for temporary work files. Dropping a database on the primary conflicts with sessions connected to that database on the standby. Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still \u0026ldquo;see\u0026rdquo; any of the rows to be removed. Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page on the standby, whether or not the data to be removed is visible. LOCK, DDL, drop tablespace, drop database — definitely none of those.\nVacuum — none either, confirmed by pg_stat_all_tables.last_autovacuum and WAL vacuum records.\nThe official documentation\u0026rsquo;s explanation stops there. I carefully verified that none of the above applied.\nExtrapolating from existing knowledge, perhaps other scenarios could kill the xmin held by a standby query\u0026rsquo;s snapshot. For example, in-page pruning removes xmin from rows on a page — if the standby query\u0026rsquo;s snapshot still depends on those xmins, theoretically a conflict could occur. But a page belongs to a specific table, and querying only one table holds only snapshots and xmins on that table. So, theoretically, in-page pruning on table A should not cause a query conflict on table B (this understanding was also wrong — I\u0026rsquo;ll explain later).\nPG\u0026rsquo;s official documentation on query conflict scenarios is fairly vague and doesn\u0026rsquo;t explain well why a static table can experience conflicts. Even combining it with my own extrapolations, there shouldn\u0026rsquo;t be a conflict. But I noticed this pattern seemed to exist on many instances, so it was worth investigating.\nRoot Cause Analysis # Since the startup process kills the query, checking the startup process\u0026rsquo;s pstack should reveal the conflict function:\n$ pstack 212012 #0 0x00002b283f63d783 in __select_nocancel () from /lib64/libc.so.6 #1 0x00000000008fcf5a in pg_usleep (microsec=\u0026lt;optimized out\u0026gt;) at pgsleep.c:56 #2 0x0000000000787905 in WaitExceedsMaxStandbyDelay (wait_event_info=134217762) at standby.c:208 #3 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x2398a50, reason=reason@entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=wait_event_info@entry=134217762, report_waiting=report_waiting@entry=true) at standby.c:276 #4 0x0000000000787b33 in ResolveRecoveryConflictWithVirtualXIDs (report_waiting=true, wait_event_info=134217762, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=\u0026lt;optimized out\u0026gt;) at standby.c:333 #5 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=\u0026lt;optimized out\u0026gt;, node=...) at standby.c:329 #6 0x00000000004c8ffe in heap_xlog_clean (record=0x2366978) at heapam.c:7764 #7 heap2_redo (record=0x2366978) at heapam.c:8917 #8 0x0000000000519e55 in StartupXLOG () at xlog.c:7411 #9 0x000000000072f211 in StartupProcessMain () at startup.c:204 #10 0x00000000005286b1 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffeb7e39d70) at bootstrap.c:450 #11 0x000000000072c369 in StartChildProcess (type=StartupProcess) at postmaster.c:5494 #12 0x000000000072eb54 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x232edb0) at postmaster.c:1407 #13 0x00000000004892cf in main (argc=3, argv=0x232edb0) at main.c:210 XLOG_HEAP2_CLEAN # void heap2_redo(XLogReaderState *record) { uint8\tinfo = XLogRecGetInfo(record) \u0026amp; ~XLR_INFO_MASK; switch (info \u0026amp; XLOG_HEAP_OPMASK) { case XLOG_HEAP2_CLEAN: heap_xlog_clean(record); break; Only when the redo is XLOG_HEAP2_CLEAN does it enter the next function heap_xlog_clean.\nPG 18 no longer has XLOG_HEAP2_CLEAN (it was actually removed around PG15 — this article only looks at versions 13 and 18), but the define can still be found in heapam_xlog.h:\n//pg13 #define XLOG_HEAP2_CLEAN\t0x10 #define XLOG_HEAP2_FREEZE_PAGE\t0x20 #define XLOG_HEAP2_CLEANUP_INFO 0x30 //pg18 * There\u0026#39;s no difference between XLOG_HEAP2_PRUNE_ON_ACCESS, * XLOG_HEAP2_PRUNE_VACUUM_SCAN and XLOG_HEAP2_PRUNE_VACUUM_CLEANUP records. * They have separate opcodes just for debugging and analysis purposes, to * indicate why the WAL record was emitted. */ #define XLOG_HEAP2_PRUNE_ON_ACCESS\t0x10 #define XLOG_HEAP2_PRUNE_VACUUM_SCAN\t0x20 #define XLOG_HEAP2_PRUNE_VACUUM_CLEANUP\t0x30 I pulled out PG18\u0026rsquo;s source because PG13 (our production version) has zero explanation for these CLEAN xl_info macros, making them hard to understand. Since PG18 renamed the macros to something more intuitive and added comments, we can use PG18\u0026rsquo;s source to understand PG13\u0026rsquo;s — to figure out what this WAL record does.\nAll three opcodes are fundamentally PRUNE-related WAL records. From the names, PRUNE_ON_ACCESS looks like pruning triggered by access, while the other two are tied to VACUUM operations.\nWhen checking with pg_waldump, rmgr: Heap2 CLEAN remxid records appear every few seconds, with highly varied filenodes and no relation to the static table:\n$ pg_waldump 00000001000012FE00000001 |tail -200|egrep -i heap2 pg_waldump: fatal: error in WAL record at 12FE/F34F138: invalid resource manager ID 50 at 12FE/F34F168 rmgr: Heap2 len (rec/tot): 61/ 3520, tx: 0, lsn: 12FE/0F346ED0, prev 12FE/0F346EA0, desc: CLEAN remxid 1983744188, blkref #0: rel 1663/88121/1083807 blk 617606 FPW rmgr: Heap2 len (rec/tot): 66/ 66, tx: 0, lsn: 12FE/0F34BC60, prev 12FE/0F34BC30, desc: CLEAN remxid 1984090598, blkref #0: rel 1663/88121/504681 blk 1447147 This matches our symptom pattern: no vacuum activity, but PRUNE is happening, leading into heap_xlog_clean → ResolveRecoveryConflictWithSnapshot and the rest of the conflict machinery.\nThe PRUNE action producing rmgr: Heap2 CLEAN remxid WAL records will be demonstrated later via testing.\nLet\u0026rsquo;s finish the source code analysis first.\nResolveRecoveryConflictWithSnapshot # void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node) { VirtualTransactionId *backends; /* * If we get passed InvalidTransactionId then we do nothing (no conflict). * * This can happen when replaying already-applied WAL records after a * standby crash or restart, or when replaying an XLOG_HEAP2_VISIBLE * record that marks as frozen a page which was already all-visible. It\u0026#39;s * also quite common with records generated during index deletion * (original execution of the deletion can reason that a recovery conflict * which is sufficient for the deletion operation must take place before * replay of the deletion record itself). */ if (!TransactionIdIsValid(latestRemovedXid)) return; backends = GetConflictingVirtualXIDs(latestRemovedXid, node.dbNode); ResolveRecoveryConflictWithVirtualXIDs(backends, PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT, true); } There are several types of query conflicts. ResolveRecoveryConflictWithSnapshot lives up to its name — it\u0026rsquo;s a snapshot conflict.\nGetConflictingVirtualXIDs finds which backends conflict with the snapshot. ResolveRecoveryConflictWithVirtualXIDs handles the actual conflict resolution and timeout.\nGetConflictingVirtualXIDs # GetConflictingVirtualXIDs is the key function that determines whether a backend\u0026rsquo;s virtual transaction ID triggers a query conflict. It requires a bit of brainpower.\nPrerequisite knowledge for understanding this function:\nlimitXmin is latestRemovedXid — the CLEAN remxid from WAL, the xid that needs to be cleaned up (I read remxid as \u0026ldquo;remove xid\u0026rdquo;). /*limitXmin is supplied as either latestRemovedXid, or InvalidTransactionId*/ PGPROC contains current process info: backend id, database id, lock info, and much more PGXACT contains the transaction info for the snapshot held by the current process. It\u0026rsquo;s lighter — the key field is xmin, the lowest xid the current process considers still running C\u0026rsquo;s || rule: if either operand is true (non-zero), the result is true (1) TransactionIdIsValid means xid != 0 — 0 is meaningless Key function GetConflictingVirtualXIDs explained:\nVirtualTransactionId * GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid) { ... for (index = 0; index \u0026lt; arrayP-\u0026gt;numProcs; index++) // iterate all local processes { int\tpgprocno = arrayP-\u0026gt;pgprocnos[index]; PGPROC\t*proc = \u0026amp;allProcs[pgprocno]; // process\u0026#39;s PGPROC PGXACT\t*pgxact = \u0026amp;allPgXact[pgprocno]; // process\u0026#39;s PGXACT /* Exclude prepared transactions */ if (proc-\u0026gt;pid == 0) // prepared transactions have no owning process — can\u0026#39;t handle continue; if (!OidIsValid(dbOid) || // global tables have dbOid=0 which is invalid — satisfies condition proc-\u0026gt;databaseId == dbOid) // only process current database. Cross-db is different — no transaction conflict at all. { /* Fetch xmin just once - can\u0026#39;t change on us, but good coding */ TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact-\u0026gt;xmin); // pgxact-\u0026gt;xmin is the minimum xid of transactions held by this process. UINT32_ACCESS_ONCE is just for atomic access protection — the xmin logic is unchanged /* * We ignore an invalid pxmin because this means that backend has * no snapshot currently. We hold a Share lock to avoid contention * with users taking snapshots. That is not a problem because the * current xmin is always at least one higher than the latest * removed xid, so any new snapshot would never conflict with the * test here. */ if (!TransactionIdIsValid(limitXmin) || // limitXmin=0 possible? At least latestRemovedXid can\u0026#39;t be — I can\u0026#39;t think of a scenario where WAL would log an invalid xid (TransactionIdIsValid(pxmin) \u0026amp;\u0026amp; !TransactionIdFollows(pxmin, limitXmin))) // TransactionIdIsValid(pxmin) is also not really needed. !TransactionIdFollows(pxmin, limitXmin) means pxmin \u0026lt;= limitXmin { VirtualTransactionId vxid; GET_VXID_FROM_PGPROC(vxid, *proc); if (VirtualTransactionIdIsValid(vxid)) vxids[count++] = vxid; } } } The critical line is !TransactionIdFollows(pxmin, limitXmin).\nSo the core logic for determining query conflicts is:\nThe primary\u0026rsquo;s cleaned remxid \u0026gt;= the standby query\u0026rsquo;s snapshot-held minimum xid → conflict. Only kills queries in the current database; global system tables (no database) are killed indiscriminately. This means: even if the pruned table on the primary has nothing to do with the table being queried on the standby, a conflict CAN occur!!!\nIn-Page Pruning # Now that the conflict logic is clear, we still need to understand where the WAL CLEAN records come from. That requires looking at how PRUNE is triggered.\nFrom README.HOT on when pruning and defragmentation occur — \u0026ldquo;When can/should we prune or defragment?\u0026rdquo;:\nThe currently planned heuristic is to prune and defrag when first accessing a page that potentially has prunable tuples\nPrune and defragment are indeed two distinct concepts, but they often happen together.\nPrune: updating line pointers to shorten HOT chains, but doesn\u0026rsquo;t free space Defragment: reclaiming space from dead line pointers and tuples after pruning We cannot prune or defragment unless we can get a \u0026ldquo;buffer cleanup lock\u0026rdquo; on the target page; otherwise, pruning might destroy line pointers that other backends have live references to, and defragmenting might move tuples that other backends have live pointers to\nThe page must be under a \u0026ldquo;buffer cleanup lock\u0026rdquo; for prune or defragment to occur.\nThe worst-case consequence of this is only that an UPDATE cannot be made HOT but has to link to a new tuple version placed on some other page, for lack of centralized space on the original page.\nA typical scenario: a HOT update spills to another page (easy to test).\nspace reclamation happens during tuple retrieval when the page is nearly full (\u0026lt;10% free) and a buffer cleanup lock can be acquired. This means that UPDATE, DELETE, and SELECT can trigger space reclamation, but often not during INSERT \u0026hellip; VALUES because it does not retrieve a row.\nSELECT/UPDATE/DELETE that scan rows can trigger space reclamation. INSERT typically won\u0026rsquo;t, since it doesn\u0026rsquo;t retrieve rows.\nClearly, after prune or defragment, the corresponding xids should be reclaimed. From the README we can see that HOT updates can reproduce prune/defragment, generating CLEAN WAL records. See [Test: Pure UPDATE Produces In-Page Pruning](## Test: Pure UPDATE Produces In-Page Pruning).\nTesting # The tests below only observe whether conflicts occur, whether CLEAN WAL records appear, or whether page line pointers are updated — without distinguishing prune vs. defragment. In many cases both are triggered together; distinguishing them is tedious and maybe best left for later. The focus here is whether CLEAN WAL records appear.\nHelper SQL:\n--sql for test --heap_page_items select t_ctid,lp, case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags, t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page(\u0026#39;lzl\u0026#39;,0)) item, LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; --heap header select * from page_header(get_raw_page(\u0026#39;lzl\u0026#39;,0)); --bt_page_items SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idxlzl\u0026#39;,1); --create table create table lzl(a char(2000)); create index idxlzl on lzl(a); insert into lzl values(\u0026#39;z\u0026#39;); update lzl set a=md5(random()::text); -- non-hot update lzl set a=\u0026#39;z\u0026#39;; -- hot --force index scan set enable_seqscan =off; set enable_indexonlyscan=off; --open an RR transaction to hold a snapshot for observation BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; Test: Cross-Table Query Conflict # primary standby create table lzl(a bigint primary key); insert into lzl values(1); BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; select 1; update lzl set a=2; no blocking vacuum lzl; #3 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x277c340, reason=reason@entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=wait_event_info@entry=134217762, report_waiting=report_waiting@entry=true) at standby.c:276#4 0x0000000000787b33 in ResolveRecoveryConflictWithVirtualXIDs (report_waiting=true, wait_event_info=134217762, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=) at standby.c:333#5 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=, node=\u0026hellip;) at standby.c:329#6 0x00000000004c8ffe in heap_xlog_clean (record=0x273a258) at heapam.c:7764 Conclusion: As long as a query exists, it has a snapshot, and a snapshot has a snapshot xmin. Even if the queried table is completely unrelated, a query conflict CAN occur.\nTest: Vacuum Produces In-Page Pruning # Pruning occurs, conflicts occur. Example omitted — not relevant to this case.\nTest: UPDATE Produces In-Page Pruning # --HOT, off-page update triggers defragment --An 8k heap page stores 4-2xx rows. Here we size rows so 4 fit and remain HOT — the next update spills off-page create table lzl(a char(2000)); create table idxlzl on lzl(a); insert into lzl values(\u0026#39;z\u0026#39;); update lzl set a=\u0026#39;z\u0026#39;; --hot update lzl set a=\u0026#39;z\u0026#39;; --hot update lzl set a=\u0026#39;z\u0026#39;; --hot --heap page: 4 rows, all HOT: t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | --------+----+-----------+----------+----------+-------+----------------------------------------------------------------------------------------------------------+----------------+---------- (0,2) | 1 | LP_NORMAL | 34954161 | 34954162 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED} | {} | \\x501f000 (0,3) | 2 | LP_NORMAL | 34954162 | 34954163 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f000 (0,4) | 3 | LP_NORMAL | 34954163 | 34954164 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f000 (0,4) | 4 | LP_NORMAL | 34954164 | 0 | 0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f000 (4 rows) --index: only one entry: itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+-------+---------+-------+------+------+-------+----------- 1 | (0,1) | 48 | f | t | f | (0,1) | [null] --One more update triggers off-page update update lzl set a=\u0026#39;z\u0026#39;; --page full, can\u0026#39;t HOT --HOT chain changed. LP changed t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | --------+----+-------------+----------+----------+--------+--------------------------------------------------------------------------------------+----------------+--------------------------- [null] | 1 | LP_REDIRECT | [null] | [null] | [null] | [null] | [null] | [null] (0,2) | 2 | LP_NORMAL | 34954165 | 0 | 0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f00007a20202020202020 [null] | 3 | 0:LP_UNUSED | [null] | [null] | [null] | [null] | [null] | [null] (0,2) | 4 | LP_NORMAL | 34954164 | 34954165 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f00007a20202020202020 (4 rows) --index: still only one entry, unchanged: itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+-------+---------+-------+------+------+-------+----------- 1 | (0,1) | 48 | f | t | f | (0,1) | [null] The next update doesn\u0026rsquo;t go to a new page — instead, in-page pruning happens first, freeing space on the same page, so the row is written locally. This saves a page access.\nWAL produces CLEAN remxid, confirming that a query conflict can occur:\nrmgr: Heap2 len (rec/tot): 62/ 62, tx: 0, lsn: 3DB/F8017348, prev 3DB/F8017310, desc: CLEAN remxid 34954177, blkref #0: rel 1663/5893914/5893920 blk 0 rmgr: Heap len (rec/tot): 2070/ 2070, tx: 34954178, lsn: 3DB/F8017388, prev 3DB/F8017348, desc: HOT_UPDATE off 4 xmax 34954178 flags 0x10 ; new off 2 xmax 0, blkref #0: rel 1663/5893914/5893920 blk 0 Conclusion: UPDATE statements can produce in-page pruning and can cause query conflicts.\nTest: Hint-Bit Writeback Producing In-Page Pruning? # primary standby wal_log_hints=on truncate table lzl; insert into lzl values(\u0026lsquo;z\u0026rsquo;); BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; select * from lzl; delete from lzl where a=\u0026lsquo;z\u0026rsquo;; checkpoint; select * from lzl; \u0026ndash;WAL contains FPI_FOR_HINT \u0026ndash;no query conflict Standby pageinspect:\nt_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | substring --------+----+-----------+----------+----------+-------+------------------------------------------------------------------------------+----------------+------------------------------------------------- (0,1) | 1 | LP_NORMAL | 34954229 | 34954230 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_KEYS_UPDATED} | {} | \\x501f00007a202020202020202020202020202020202020 (1 row) Conclusion: WAL log hints only sync hint bits and don\u0026rsquo;t affect xmin/xmax. No CLEAN or similar records are produced, so hint-bit writeback does NOT cause query conflicts.\nTest: SELECT Produces In-Page Pruning # SELECT normally doesn\u0026rsquo;t cause pruning, but it does when the page is nearly full: https://www.modb.pro/db/1683648157451362304\nTesting pruning on a full page:\n-- Same table as before, 4 HOT rows, nearly full insert into lzl values(\u0026#39;z\u0026#39;); update lzl set a=\u0026#39;z\u0026#39;; update lzl set a=\u0026#39;z\u0026#39;; update lzl set a=\u0026#39;z\u0026#39;; --page at this point: t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | --------+----+-----------+----------+----------+-------+----------------------------------------------------------------------------------------------------------+----------------+--------------------- (0,2) | 1 | LP_NORMAL | 34954232 | 34954233 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED} | {} | \\x501f00007a20202020 (0,3) | 2 | LP_NORMAL | 34954233 | 34954234 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f00007a20202020 (0,4) | 3 | LP_NORMAL | 34954234 | 34954235 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f00007a20202020 (0,4) | 4 | LP_NORMAL | 34954235 | 0 | 0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f00007a20202020 (4 rows) -- A SELECT select * from lzl; --page now shows in-page pruning: t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | sub --------+----+-------------+----------+--------+--------+---------------------------------------------------------------------------------------+----------------+--------------------------------------- [null] | 1 | LP_REDIRECT | [null] | [null] | [null] | [null] | [null] | [null] [null] | 2 | 0:LP_UNUSED | [null] | [null] | [null] | [null] | [null] | [null] [null] | 3 | 0:LP_UNUSED | [null] | [null] | [null] | [null] | [null] | [null] (0,4) | 4 | LP_NORMAL | 34954235 | 0 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} | \\x501f00007a20202020202020202020202020 Conclusion: SELECT can produce in-page pruning and can cause query conflicts.\nTest: Shared Table Cross-Database Query Conflict # Shared tables are global. Earlier in GetConflictingVirtualXIDs we saw that global tables are killed indiscriminately. Let\u0026rsquo;s test.\nShared table info:\nSource definition: IsSharedRelation Source check: shared ? InvalidOid : MyDatabaseId; Table: pg_class.relisshared Directory: global/ Querying pg_class.relisshared directly is easier:\nselect relname,relkind,relisshared from pg_class where relisshared is true and relkind=\u0026#39;r\u0026#39;; relname | relkind | relisshared -----------------------+---------+------------- pg_authid | r | t pg_subscription | r | t pg_database | r | t pg_db_role_setting | r | t pg_tablespace | r | t pg_auth_members | r | t pg_shdepend | r | t pg_shdescription | r | t pg_replication_origin | r | t pg_shseclabel | r | t pg_authid stores role/user info. Testing with a password change:\n--Test: on the primary, in a non-business database create user lzl; alter user lzl with password \u0026#39;1\u0026#39;; --run several times CLEAN remxid appears:\nrmgr: Heap len (rec/tot): 76/ 76, tx: 34954264, lsn: 3DB/F808D0F8, prev 3DB/F808D0B8, desc: HOT_UPDATE off 67 xmax 34954264 flags 0x20 ; new off 66 xmax 0, blkref #0: rel 1664/0/1260 blk 0 rmgr: Transaction len (rec/tot): 82/ 82, tx: 34954264, lsn: 3DB/F808D148, prev 3DB/F808D0F8, desc: COMMIT 2025-09-12 14:40:56.680782 CST; inval msgs: catcache 11 catcache 10 rmgr: Heap2 len (rec/tot): 60/ 60, tx: 0, lsn: 3DB/F808D1A0, prev 3DB/F808D148, desc: CLEAN remxid 34954264, blkref #0: rel 1664/0/1260 blk 0 rmgr: Heap2 len (rec/tot): 60/ 60, tx: 34954265, lsn: 3DB/F808D1E0, The standby business database\u0026rsquo;s select 1 query was killed.\nConclusion: Shared tables can cause cross-database query conflicts.\nThat said, these shared system tables rarely see heavy updates in normal operations.\nConclusions # Developer Perspective # Query conflicts can be completely unrelated to the table being queried — meaning a fully static table CAN experience conflicts.\nCross-database means different business domains and data. Cross-database does NOT cause query conflicts. The one exception is shared tables, but these are just a handful of system tables that rarely see updates.\nFor developers, focus on:\nRetry on failure: Standby queries can be killed — retrying is essential, and retries may succeed Query duration: Longer queries are more likely to be killed Alternative standbys: Consider using a different standby with lower disaster-recovery requirements Operations Perspective # Since query conflicts can come from \u0026ldquo;all directions,\u0026rdquo; a simple long-running single-table query can be killed by in-page pruning on a completely different, frequently-updated table. You can increase max_standby_streaming_delay to reduce conflict probability.\nHowever, max_standby_streaming_delay trades off against WAL apply — a longer delay means WAL application is paused. This parameter\u0026rsquo;s value directly represents the maximum possible standby replication lag (it can\u0026rsquo;t cap lag from network or other factors).\nQuery freshness: Prolonged WAL apply pauses mean the standby data lags significantly (WAL may already be on the standby\u0026rsquo;s disk), affecting data freshness requirements for other standby queries. RTO: If the primary suffers a disaster and failover is needed, the standby must apply accumulated WAL. If apply delay stretches to hours, it may violate minute-level RTO SLAs. So tuning max_standby_streaming_delay is a delicate exercise requiring consideration of the standby\u0026rsquo;s role, query freshness requirements, and even geography.\n","date":"Sep 13, 2025","externalUrl":null,"permalink":"/en/2025/09/13/query-conflicts-from-a-static-table-conflict-to-its-root-cause/","section":"Posts","summary":"Problem Symptoms # The Symptom # A static historical table with no updates whatsoever — yet queries on the same-city standby consistently hit query conflicts:\nERROR: 40001: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. LOCATION: ProcessInterrupts, postgres.c:3197 Time: 30534.973 ms (00:30.535) Why a Query Conflict on a Static Table Matters # My understanding was that a static table should never experience conflicts (this understanding was wrong — I’ll explain later).\n","title":"Query Conflicts: From a Static Table Conflict to Its Root Cause","type":"posts"},{"content":" PARAMETER_CHANGE and Database Parameters on the Control File # Some PG parameters affect the standby\u0026rsquo;s operation. These parameters are not only in the configuration file but also written to the control file. Whenever parameters change, they are written to WAL and update the control file.\nThe standby redoes the PARAMETER_CHANGE WAL record and writes to the standby\u0026rsquo;s control file. PARAMETER_CHANGE WAL record:\nrmgr: XLOG len (rec/tot): 54/ 54, tx: 0, lsn: 27F/800001C0, prev 27F/80000148, desc: PARAMETER_CHANGE max_connections=3000 max_worker_processes=20 max_wal_senders=10 max_prepared_xacts=0 max_locks_per_xact=1024 wal_level=logical wal_log_hints=off track_commit_timestamp=on XLOG_PARAMETER_CHANGE records these 8 parameters, which can also be viewed directly from the control file:\n$ pg_controldata |grep setting wal_level setting: logical wal_log_hints setting: on max_connections setting: 1000 max_worker_processes setting: 20 max_wal_senders setting: 10 max_prepared_xacts setting: 0 max_locks_per_xact setting: 1024 track_commit_timestamp setting: on These parameters are all from the primary, even if this control file belongs to the standby.\nThe startup process checks 6 parameters via the CheckRequiredParameterValues function. One parameter wal_level must be \u0026gt;= replica. The other 5 parameters — max_connections, max_worker_processes, max_wal_senders, max_prepared_transactions, max_locks_per_transaction — are checked for primary vs standby sizing. If the standby has a smaller value, recovery is paused. If you increase the primary\u0026rsquo;s parameters directly, the standby will crash. The PG log:\nFATAL,22023,\u0026#34;hot standby is not possible because max_connections = 2000 is a lower setting than on the master server (its value was 3000)\u0026#34;,,,,,\u0026#34;WAL redo at 27F/800001C0 for XLOG/PARAMETER_CHANGE: max_connections=3000 max_worker_processes=20 max_wal_senders=10 max_prepared_xacts=0 max_locks_per_xact=1024 wal_level=logical wal_log_hints=off track_commit_timestamp=on\u0026#34;,,,,\u0026#34;\u0026#34;,\u0026#34;startup\u0026#34; 6 of the 8 parameters can seriously affect standby operation. The other 2 parameters — wal_log_hints, track_commit_timestamp — are not immediately checked by the startup process. All 8 parameters being synchronized to the control file serve their own purposes.\nwal_log_hints Primary-Standby Mismatch # Changes to wal_log_hints are recorded in WAL logs. Although not checked by the startup process, pg_rewind does check it:\nperform_rewind(...) { ... /* * Target cluster need to use checksums or hint bit wal-logging, this to * prevent from data corruption that could occur because of hint bits. */ if (ControlFile_target.data_checksum_version != PG_DATA_CHECKSUM_VERSION \u0026amp;\u0026amp; !ControlFile_target.wal_log_hints) { pg_fatal(\u0026#34;target server needs to use either data checksums or \\\u0026#34;wal_log_hints = on\\\u0026#34;\u0026#34;); } Since wal_log_hints is WAL-related, it doesn\u0026rsquo;t make sense for pg_rewind to check whether the standby\u0026rsquo;s wal_log_hints is enabled — it should check whether the primary\u0026rsquo;s wal_log_hints is enabled. Therefore, PG synchronizes the wal_log_hints parameter to the standby\u0026rsquo;s control file, which is very reasonable.\nwal_log_hints primary-standby mismatch test:\nselect * from t1; checkpoint; update t1 set b=\u0026#39;eee\u0026#39;; -- observation point 1 checkpoint; -- ignore this online checkpoint wal record select * from t1; -- observation point 2 -- observation action pg_waldump 000000020000027F0000000A|tail -10 -- observing option select t_ctid,lp, case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags, t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page(\u0026#39;t1\u0026#39;,0)) item, LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; on, on:\n-- Observation point 1: rmgr: Heap len (rec/tot): 85/ 208, tx: 11140182, lsn: 27F/5000CC38, prev 27F/5000CBC0, desc: HOT_UPDATE off 3 xmax 11140182 flags 0x10 ; new off 4 xmax 0, blkref #0: rel 1663/7472552/7472597 blk 0 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 11140182, lsn: 27F/5000CD08, prev 27F/5000CC38, desc: COMMIT 2025-07-21 18:28:13.292397 CST rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/5000CD38, prev 27F/5000CD08, desc: RUNNING_XACTS nextXid 11140183 latestCompletedXid 11140182 oldestRunningXid 11140183 -- Observation point 2: rmgr: XLOG len (rec/tot): 51/ 171, tx: 0, lsn: 27F/58000110, prev 27F/580000D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/7472552/7472597 blk 0 FPW rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/580001C0, prev 27F/58000110, desc: RUNNING_XACTS nextXid 11140183 latestCompletedXid 11140182 oldestRunningXid 11140183 off, off:\n-- Observation point 1: rmgr: Heap len (rec/tot): 85/ 225, tx: 11140183, lsn: 27F/580003C8, prev 27F/58000390, desc: HOT_UPDATE off 4 xmax 11140183 flags 0x10 ; new off 5 xmax 0, blkref #0: rel 1663/7472552/7472597 blk 0 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 11140183, lsn: 27F/580004B0, prev 27F/580003C8, desc: COMMIT 2025-07-21 18:33:18.192146 CST rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/580004E0, prev 27F/580004B0, desc: RUNNING_XACTS nextXid 11140184 latestCompletedXid 11140183 oldestRunningXid 11140184 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/58000518, prev 27F/580004E0, desc: RUNNING_XACTS nextXid 11140184 latestCompletedXid 11140183 oldestRunningXid 11140184 -- Observation point 2: on, off:\n-- Observation point 1: rmgr: Heap len (rec/tot): 85/ 274, tx: 11140186, lsn: 27F/58000C18, prev 27F/58000BA0, desc: HOT_UPDATE off 7 xmax 11140186 flags 0x10 ; new off 8 xmax 0, blkref #0: rel 1663/7472552/7472597 blk 0 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 11140186, lsn: 27F/58000D30, prev 27F/58000C18, desc: COMMIT 2025-07-21 18:40:17.638691 CST rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/58000D60, prev 27F/58000D30, desc: RUNNING_XACTS nextXid 11140187 latestCompletedXid 11140186 oldestRunningXid 11140187 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/58000D98, prev 27F/58000D60, desc: RUNNING_XACTS nextXid 11140187 latestCompletedXid 11140186 oldestRunningXid 11140187 -- Observation point 2: rmgr: XLOG len (rec/tot): 51/ 236, tx: 0, lsn: 27F/58000E48, prev 27F/58000DD0, desc: FPI_FOR_HINT , blkref #0: rel 1663/7472552/7472597 blk 0 FPW rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/58000F38, prev 27F/58000E48, desc: RUNNING_XACTS nextXid 11140187 latestCompletedXid 11140186 oldestRunningXid 11140187 off, on:\n-- Observation point 1: rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/58001108, prev 27F/58001090, desc: RUNNING_XACTS nextXid 11140187 latestCompletedXid 11140186 oldestRunningXid 11140187 rmgr: Heap len (rec/tot): 85/ 289, tx: 11140187, lsn: 27F/58001140, prev 27F/58001108, desc: HOT_UPDATE off 8 xmax 11140187 flags 0x10 ; new off 9 xmax 0, blkref #0: rel 1663/7472552/7472597 blk 0 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 11140187, lsn: 27F/58001268, prev 27F/58001140, desc: COMMIT 2025-07-21 18:44:08.550109 CST rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 27F/58001298, prev 27F/58001268, desc: RUNNING_XACTS nextXid 11140188 latestCompletedXid 11140186 oldestRunningXid 11140187; 1 xacts: 11140187 rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/580012D0, prev 27F/58001298, desc: RUNNING_XACTS nextXid 11140188 latestCompletedXid 11140187 oldestRunningXid 11140188 -- Observation point 2: Test summary:\nFPI_FOR_HINT is produced when hint bits are written back; SELECT queries can produce FPI_FOR_HINT. Regardless of the standby setting (on or off), when the primary is on, FPI_FOR_HINT will be produced. Additional Knowledge: What is XLOG_RUNNING_XACTS # XLOG_RUNNING_XACTS is one type of RM_STANDBY_ID:\n/* * XLOG message types */ #define XLOG_STANDBY_LOCK\t0x00 #define XLOG_RUNNING_XACTS\t0x10 #define XLOG_INVALIDATIONS\t0x20 XLOG_STANDBY_LOCK: Records acquisition and release of AccessExclusiveLock, used by standby nodes to recognize lock states.\nXLOG_RUNNING_XACTS: Running-xacts snapshots used for building snapshots to ensure transaction consistency.\nXLOG_INVALIDATIONS: INVALIDATIONS messages for synchronizing metadata information to local backends.\n* standbydefs.h *\tFrontend exposed definitions for hot standby mode. RM_STANDBY_ID is an rmgr specifically defined for hot standby read-only standbys. For local instance recovery and logical decoding scenarios that need WAL, RM_STANDBY_ID is essentially meaningless to them.\nObserving WAL records during transaction commit:\ncommand wal record begin; select * from txid_current(); \u0026ndash;11140191 commit; rmgr: Transaction len (rec/tot): 46/ 46, tx: 11140191, lsn: 27F/80000538, prev 27F/80000500, desc: COMMIT 2025-07-23 11:16:10.872724 CST rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 27F/80000568, prev 27F/80000538, desc: RUNNING_XACTS nextXid 11140192 latestCompletedXid 11140191 oldestRunningXid 11140192 The transaction ID itself — commit or abort — is synchronized by rmgr: Transaction. Snapshots are synchronized via rmgr: Standby RUNNING_XACTS.\ntrack_commit_timestamp Primary-Standby Mismatch # track_commit_timestamp: the startup process activates the standby\u0026rsquo;s commit_ts functionality upon receiving the corresponding WAL, primarily for viewing xid commit times on the standby:\n/* * Activate or deactivate CommitTs\u0026#39; upon reception of a XLOG_PARAMETER_CHANGE * XLog record during recovery. */ void CommitTsParameterChange(bool newvalue, bool oldvalue) { /* * If the commit_ts module is disabled in this server and we get word from * the primary server that it is enabled there, activate it so that we can * replay future WAL records involving it; also mark it as active on * pg_control. If the old value was already set, we already did this, so * don\u0026#39;t do anything. * * If the module is disabled in the primary, disable it here too, unless * the module is enabled locally. * * Note this only runs in the recovery process, so an unlocked read is * fine. */ if (newvalue) { if (!commitTsShared-\u0026gt;commitTsActive) ActivateCommitTs(); } else if (commitTsShared-\u0026gt;commitTsActive) DeactivateCommitTs(); } track_commit_timestamp primary-standby mismatch test:\nInitial state: primary=on, standby=on. Both can use committed_xact and similar functions. primary=off (restart primary), standby=on (no change). Both cannot use committed_xact and similar functions. After modifying and restarting the primary, standby replication remains normal, but committed_xact and similar functions are unusable:\n$ select * from pg_last_committed_xact(); ERROR: 55000: could not get commit timestamp data HINT: Make sure the configuration parameter \u0026#34;track_commit_timestamp\u0026#34; is set on the primary server. LOCATION: error_commit_ts_disabled, commit_ts.c:385 $ show track_commit_timestamp -\u0026gt; ; track_commit_timestamp ------------------------ on (1 row) Time: 0.198 ms $ \\q ## pg_controldata |grep track_commit_timestamp track_commit_timestamp setting: off PG14+ Pause Recovery # PG14 improved the behavior when primary parameter changes cause standby crashes. When parameters don\u0026rsquo;t meet conditions, instead of the read-only standby directly crashing, it now only pauses replication. See RecoveryRequiresIntParameter.\nPause recovery on a hot standby server if the primary changes its parameters in a way that prevents replay on the standby (Peter Eisentraut)\nPreviously the standby would shut down immediately\nTesting PG14 parameter changes causing standby replication interruption:\n2025-07-23 19:46:31.337 CST,,,141823,,6880ca5f.229ff,14,,2025-07-23 19:41:19 CST,1/0,0,LOG,00000,\u0026#34;recovery has paused\u0026#34;,\u0026#34;If recovery is unpaused, the server will shut down.\u0026#34;,\u0026#34;You can then restart the server after making the necessary configuration changes.\u0026#34;,,,\u0026#34;WAL redo at 281/78324BE8 for XLOG/PARAMETER_CHANGE: max_connections=2000 max_worker_processes=20 max_wal_senders=10 max_prepared_xacts=0 max_locks_per_xact=1024 wal_level=logical wal_log_hints=on track_commit_timestamp=on\u0026#34;,,,,\u0026#34;\u0026#34;,\u0026#34;startup\u0026#34;,,0 Since replication has already stopped, changing the primary\u0026rsquo;s parameters back won\u0026rsquo;t help — the standby can\u0026rsquo;t apply subsequent changes and update the control file. So you must modify the standby\u0026rsquo;s parameters and restart (the log hint is also quite clear).\nSummary of the 8 Parameters # When any of the 8 parameters are modified on the primary and the primary is restarted, the local control file is updated. If parameters have changed, the updated parameters are written to WAL and synchronized to downstream. The downstream redoes this PARAMETER_CHANGE WAL record, updating its local control file. The standby then determines whether primary-standby replication or other functions are available based on certain conditions.\n8 Parameters Written to Control File Check If not, standby (PG13-) If not, standby (PG14+) wal_level !=minimal Cannot sync, fundamental Cannot sync, fundamental max_connections primary \u0026lt;= standby hot standby shutdown hot standby pause replication max_worker_processes primary \u0026lt;= standby hot standby shutdown hot standby pause replication max_wal_senders primary \u0026lt;= standby hot standby shutdown hot standby pause replication max_prepared_transactions primary \u0026lt;= standby hot standby shutdown hot standby pause replication max_locks_per_transaction primary \u0026lt;= standby hot standby shutdown hot standby pause replication wal_log_hints pg_rewind prerequisite (either data checksums or wal_log_hints = on) Doesn\u0026rsquo;t affect standby sync Doesn\u0026rsquo;t affect standby sync track_commit_timestamp Enable/disable standby commit_ts functionality Doesn\u0026rsquo;t affect standby sync Doesn\u0026rsquo;t affect standby sync Special thanks to: Gao Changjun\n","date":"Aug 25, 2025","externalUrl":null,"permalink":"/en/2025/08/25/parameters-on-the-control-file-and-primary-standby-parameter-mismatch-issues/","section":"Posts","summary":"PARAMETER_CHANGE and Database Parameters on the Control File # Some PG parameters affect the standby’s operation. These parameters are not only in the configuration file but also written to the control file. Whenever parameters change, they are written to WAL and update the control file.\nThe standby redoes the PARAMETER_CHANGE WAL record and writes to the standby’s control file. PARAMETER_CHANGE WAL record:\nrmgr: XLOG len (rec/tot): 54/ 54, tx: 0, lsn: 27F/800001C0, prev 27F/80000148, desc: PARAMETER_CHANGE max_connections=3000 max_worker_processes=20 max_wal_senders=10 max_prepared_xacts=0 max_locks_per_xact=1024 wal_level=logical wal_log_hints=off track_commit_timestamp=on XLOG_PARAMETER_CHANGE records these 8 parameters, which can also be viewed directly from the control file:\n","title":"Parameters on the Control File and Primary-Standby Parameter Mismatch Issues","type":"posts"},{"content":"","date":"Aug 25, 2025","externalUrl":null,"permalink":"/en/categories/postgresql%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90/","section":"Categories","summary":"","title":"PostgreSQL源码解析","type":"categories"},{"content":" Save it, use it freely, no need to ask.\nMay be updated, may not be.\nFeedback welcome — pick it apart if you can.\nThis article was originally published in Chinese on lastdba.com.\n","date":"Jul 19, 2025","externalUrl":null,"permalink":"/en/2025/07/19/postgresql-ddl-pitfalls-and-clever-solutions/","section":"Posts","summary":" Save it, use it freely, no need to ask.\nMay be updated, may not be.\nFeedback welcome — pick it apart if you can.\nThis article was originally published in Chinese on lastdba.com.\n","title":"PostgreSQL DDL Pitfalls and Clever Solutions","type":"posts"},{"content":" Symptoms # The walsender\u0026rsquo;s LSN stopped advancing. The stack trace showed it was stuck in pathman\u0026rsquo;s invalidate_psin_entries_using_relid, with the relid constantly changing and the walsender CPU pegged at 100%.\npstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=42319501) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=42319501) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=\u0026lt;optimized out\u0026gt;, relid=42319501) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x3a391c8) at inval.c:595 #5 0x000000000071d50e in ReorderBufferExecuteInvalidations (rb=0x1b63ff8, txn=0x1be5f58, txn=0x1be5f58) at reorderbuffer.c:2238 #6 ReorderBufferCommit (rb=0x1b63ff8, xid=xid@entry=4285897514, commit_lsn=405674661986920, end_lsn=\u0026lt;optimized out\u0026gt;, commit_time=commit_time@entry=799377897828299, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1819 #7 0x0000000000712d18 in DecodeCommit (xid=4285897514, parsed=0x7fffaadf8630, buf=0x7fffaadf87f0, ctx=0x1a359e8) at decode.c:637 #8 DecodeXactOp (ctx=0x1a359e8, buf=buf@entry=0x7fffaadf87f0) at decode.c:245 #9 0x00000000007130b2 in LogicalDecodingProcessRecord (ctx=0x1a359e8, record=0x1a35c80) at decode.c:114 #10 0x0000000000733662 in XLogSendLogical () at walsender.c:2885 #11 0x0000000000735942 in WalSndLoop (send_data=send_data@entry=0x733620 \u0026lt;XLogSendLogical\u0026gt;) at walsender.c:2287 #12 0x0000000000736692 in StartLogicalReplication (cmd=0x1846c68) at walsender.c:1213 #13 exec_replication_command (cmd_string=cmd_string@entry=0x181a288 \u0026#34;START_REPLICATION SLOT \\\u0026#34;lzl_logical_rep\\\u0026#34; LOGICAL 170F5/7C3EAE78 (\\\u0026#34;proto_version\\\u0026#34; \u0026#39;1\u0026#39;, \\\u0026#34;publication_names\\\u0026#34; \u0026#39;lzl_logical_rep\u0026#39;)\u0026#34;) at walsender.c:1640 #14 0x0000000000774e91 in PostgresMain (argc=\u0026lt;optimized out\u0026gt;, argv=argv@entry=0x1866478, dbname=0x18662b8 \u0026#34;lzldb\u0026#34;, username=\u0026lt;optimized out\u0026gt;) at postgres.c:4325 #15 0x0000000000485989 in BackendRun (port=\u0026lt;optimized out\u0026gt;, port=\u0026lt;optimized out\u0026gt;) at postmaster.c:4526 #16 BackendStartup (port=0x18635b0) at postmaster.c:4210 #17 ServerLoop () at postmaster.c:1739 #18 0x0000000000702f08 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1814da0) at postmaster.c:1412 #19 0x000000000048660a in main (argc=3, argv=0x1814da0) at main.c:210 ## Second execution, same stack, different relid pstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=26560221) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=26560221) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=\u0026lt;optimized out\u0026gt;, relid=26560221) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x39f1f68) at inval.c:595 ... Analysis # The changing relid showed that the walsender was still running, not dead. The LSN was not advancing, so we analyzed the LSN position to see what the transaction was doing.\nIf the slot information was still available, we could look up the restart LSN via the slot view to find the WAL position. If not, we could use the LSN from the stack trace to identify the WAL log.\nUsing pg_waldump to inspect WAL log entries, filtering by xid:\nrmgr: Heap len (rec/tot): 961/ 961, tx: 4285897514, lsn: 170F5/7DFE3470, prev 170F5/7DFE3430, desc: UPDATE+INIT off 2 xmax 4285897514 flags 0x00 ; new off 1 xmax 0, blkref #0: rel 1663/17662/1259 blk 8443, blkref #1: rel 1663/17662/1259 blk 7327 ... rmgr: Transaction len (rec/tot): 1778325/1778325, tx: 4285897514, lsn: 170F5/7E1F4268, prev 170F5/7E1F4220, desc: COMMIT 2025-05-01 09:24:57.828299 CST; inval msgs: catcache 22 catcache 22 catcache 22 catcache 22 catcache 50 catcache 49 catcache 50 catcache 49 catcache 50 catcache 49 catcache 50 catcache 49 catcache 50 catcache 49 catcache 50 ... relcache 48813261 relcache 48813255 relcache 51030741 relcache 48813252 relcache 50737247 relcache 48813246 relcache 48813243 relcache 48813237 relcache 50737241 relcache 48813234 relcache 48813224 relcache 49379811 relcache 48813216 relcache 48813210 relcache 45452775 The transaction for rel 1663/17662/1259 had 180,000 records. The last record was inval msgs: ~70,000 catcache entries and ~30,000 relcache entries.\nrel 1663/17662/1259 is pg_class. Querying by xmin reveals the affected tables and commit time:\nselect xmin,xmax,pg_xact_commit_timestamp(xmin),relname from pg_class where xmin=\u0026#39;4285897514\u0026#39;::xid order by relname desc ; xmin | xmax | pg_xact_commit_timestamp | relname ------------+------+-------------------------------+--------------------------------------------- 4285897514 | 0 | 2025-05-01 09:24:57.828299+08 | v$session 4285897514 | 0 | 2025-05-01 09:24:57.828299+08 | tmp_20230801_id_seq 4285897514 | 0 | 2025-05-01 09:24:57.828299+08 | tmp_20230801 4285897514 | 0 | 2025-05-01 09:24:57.828299+08 | test_param 4285897514 | 0 | 2025-05-01 09:24:57.828299+08 | test_20240105 ... select count(*) from pg_class where xmin=\u0026#39;4285897514\u0026#39;::xid ; count ------- 18523 select count(*) from pg_class ; count -------- 139138 Checking the pglog by timestamp:\n2025-05-01 09:24:59.837 CST,\u0026#34;postgres\u0026#34;,\u0026#34;lzldb\u0026#34;,61418,\u0026#34;[local]\u0026#34;,6812cd65.efea,3,\u0026#34;DO\u0026#34;,2025-05-01 09:24:53 CST,549/0,0,LOG,00000,\u0026#34;duration: 6036.275 ms statement: ... EXECUTE \u0026#39;GRANT SELECT ON ALL TABLES IN SCHEMA public TO r_lzldbdata_qry\u0026#39;; ... END; $$\u0026#34;,,,,,,,,,\u0026#34;psql\u0026#34;,\u0026#34;client backend\u0026#34; We can basically confirm that the GRANT operation was the culprit. GRANT updates relacl in pg_class, and at least 18,000 relations had their permissions updated. Updates to pg_class trigger invalidation messages, and the massive number of invalidation messages were being processed slowly in the walsender process.\nReproduction # -- Create a logical replication slot, any kind will do select pg_create_logical_replication_slot(\u0026#39;logical_test\u0026#39;,\u0026#39;test_decoding\u0026#39;); pg_recvlogical -h 127.0.0.1 -p 7997 -d lzldb -U repuser --slot=logical_test --start -f recv.sql \u0026amp; -- Create many tables DO $$ BEGIN FOR i IN 1..20000 LOOP EXECUTE format( \u0026#39;CREATE TABLE IF NOT EXISTS table_%s ( col1 varchar(10) )\u0026#39;, lpad(i::text, 5, \u0026#39;0\u0026#39;) -- Generate 5-digit numbered table names ); END LOOP; END $$; -- Single GRANT grant select on all tables in schema public to r_lzldb_qry; -- Perfectly reproduced postgres@lzlhost:~/lzl/grant]$ pstack 172862 #0 hash_seq_search (status=status@entry=0x7ffd664be280) at dynahash.c:1444 #1 0x00002ad31235e728 in invalidate_psin_entries_using_relid (relid=relid@entry=1002857) at src/relation_info.c:251 #2 0x00002ad31235eb3d in forget_status_of_relation (relid=relid@entry=1002857) at src/relation_info.c:232 #3 0x00002ad31236ec96 in pathman_relcache_hook (arg=\u0026lt;optimized out\u0026gt;, relid=1002857) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x2ad3c3f61a88) at inval.c:595 #5 0x000000000071d50e in ReorderBufferExecuteInvalidations (rb=0x17e5698, txn=0x180d698, txn=0x180d698) at reorderbuffer.c:2238 [postgres@lzlhost:~/lzl/grant]$ pstack 172862 #0 0x0000000000891d0c in hash_seq_search (status=status@entry=0x7ffd664be280) at dynahash.c:1441 #1 0x00002ad31235e728 in invalidate_psin_entries_using_relid (relid=relid@entry=1011110) at src/relation_info.c:251 #2 0x00002ad31235eb3d in forget_status_of_relation (relid=relid@entry=1011110) at src/relation_info.c:232 #3 0x00002ad31236ec96 in pathman_relcache_hook (arg=\u0026lt;optimized out\u0026gt;, relid=1011110) at src/hooks.c:934 -- relid keeps changing -- CPU pegged at 100%: ps -eo pid,%cpu,%mem|grep 172862 172862 99.3 0.0 -- Takes about 2 hours to catch up Accelerating Walsender by Removing Pathman # Since the database wasn\u0026rsquo;t actually using pathman partitioned tables but had the extension installed, we tried bypassing the pathman hook to speed up walsender processing.\ndrop extension pg_pathman; grant update on all tables in schema public to r_lzldb_upd; [postgres@lzlhost~/lzl/grant]$ pstack 133460 #0 hash_seq_search (status=status@entry=0x7ffe292d5c90) at dynahash.c:1418 #1 0x000000000087f228 in RelfilenodeMapInvalidateCallback (arg=\u0026lt;optimized out\u0026gt;, relid=1034036) at relfilenodemap.c:64 #2 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x2b9699795768) at inval.c:595 #3 0x000000000071d50e in ReorderBufferExecuteInvalidations (rb=0x195a358, txn=0x1a6ff38, txn=0x1a6ff38) at reorderbuffer.c:2238 #4 ReorderBufferCommit (rb=0x195a358, xid=xid@entry=328684387, commit_lsn=8016890875224, end_lsn=\u0026lt;optimized out\u0026gt;, commit_time=commit_time@entry=799851538975691, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1819 ## Completed within 20 seconds Even without commenting out pg_pathman from shared_preload_libraries, there was a dramatic improvement — walsender went from 2 hours to 20 seconds.\nThis seemed odd at first — without commenting shared_preload_libraries, the hook should still run. Source analysis revealed the reason: the very first step of the hook checks for the pathman config table; if it doesn\u0026rsquo;t exist, it skips pathman\u0026rsquo;s invalidation logic entirely, so execution completes quickly:\n/* * Invalidate PartRelationInfo cache entry if needed. */ void pathman_relcache_hook(Datum arg, Oid relid) { Oid pathman_config_relid; /* See cook_partitioning_expression() */ if (!pathman_hooks_enabled) return; if (!IsPathmanReady()) return; ... /* * Invalidation event for PATHMAN_CONFIG table (probably DROP EXTENSION). * Digging catalogs here is expensive and probably illegal, so we take * cached relid. It is possible that we don\u0026#39;t know it atm (e.g. pathman * was disabled). However, in this case caches must have been cleaned * on disable, and there is no DROP-specific additional actions. */ pathman_config_relid = get_pathman_config_relid(true); if (relid == pathman_config_relid) { delay_pathman_shutdown(); } /* Invalidation event for some user table */ else if (relid \u0026gt;= FirstNormalObjectId) { /* Invalidate PartBoundInfo entry if needed */ forget_bounds_of_rel(relid); /* Invalidate PartStatusInfo entry if needed */ forget_status_of_relation(relid); /* Invalidate PartParentInfo entry if needed */ forget_parent_of_partition(relid); } } get_pathman_config_relid fetches the pathman_config table. drop extension pg_pathman removes the pathman_config table from the database, so the source code never enters the forget_* logic.\nThere are other ways to accelerate walsender processing: setting pg_pathman.enable=off causes IsPathmanReady() to return false and bail out immediately. Or, most directly, comment out pg_pathman from shared_preload_libraries and restart the instance (this is instance-level, not database-level).\nImprovements in PG14 # PG14.0 release notes:\nAllow logical decoding to more efficiently process cache invalidation messages (Dilip Kumar) This allows logical decoding to work efficiently in presence of a large amount of DDL.\nhttps://www.postgresql.org/docs/release/14.0/\nPatch:\nhttps://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d71\nComment from PG14\u0026rsquo;s ReorderBufferAddInvalidations:\nWe require to record it in form of the change so that we can execute only the required invalidations instead of executing all the invalidations on each CommandId increment.\nComparing PG14 vs PG13, ReorderBufferCommit underwent a major rewrite.\nIn PG13, transaction processing logic was directly in the ReorderBufferCommit function:\ncase REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID: Assert(change-\u0026gt;data.command_id != InvalidCommandId); if (command_id \u0026lt; change-\u0026gt;data.command_id) { command_id = change-\u0026gt;data.command_id; if (!snapshot_now-\u0026gt;copied) { /* we don\u0026#39;t use the global one anymore */ snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, txn, command_id); } snapshot_now-\u0026gt;curcid = command_id; TeardownHistoricSnapshot(false); SetupHistoricSnapshot(snapshot_now, txn-\u0026gt;tuplecid_hash); /* * Every time the CommandId is incremented, we could * see new catalog contents, so execute all * invalidations. */ ReorderBufferExecuteInvalidations(rb, txn); } In PG14, the main logic moved to ReorderBufferReplay -\u0026gt; ReorderBufferProcessTXN.\nReorderBufferProcessTXN introduced a new case REORDER_BUFFER_CHANGE_INVALIDATION branch to execute invalidations from the reorder buffer:\ncase REORDER_BUFFER_CHANGE_INVALIDATION: /* Execute the invalidation messages locally */ ReorderBufferExecuteInvalidations( change-\u0026gt;data.inval.ninvalidations, change-\u0026gt;data.inval.invalidations); break; The logic after ReorderBufferExecuteInvalidations is largely the same. The main differences between PG13 and PG14\u0026rsquo;s ReorderBufferCommit:\nReorderBufferCommit is no longer the primary transaction processing function; the call stack is deeper A new case REORDER_BUFFER_CHANGE_INVALIDATION branch was added, separated from REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, to handle invalidations independently The per-command_id invalidation processing logic was removed Root Cause and Solutions # The root cause of the walsender hang was a bulk GRANT operation that updated many rows in pg_class, triggering a massive number of invalidation messages. A statement like GRANT privs ON ALL TABLES IN SCHEMA public TO role1 executes as multiple commands within a single transaction in PostgreSQL. In PG13, logical replication processes invalidation messages per-command, invoking each hook\u0026rsquo;s inval hash table processing. In this scenario, pathman\u0026rsquo;s hook was particularly slow at processing the inval hash table, causing replication lag.\nConditions for pathman-induced slowness (all must apply):\nPG13 or earlier Bulk GRANT pathman extension installed (whether used or not) Logical replication slot active Even after removing pathman, significant CPU time was still spent in functions like RelfilenodeMapInvalidateCallback. In PG13 testing, the processing time difference between with and without pathman was hours vs. minutes.\nOther untested but community-mentioned scenarios (all must apply):\nPG13 or earlier Bulk DDL / TRUNCATE / DCL / DROP PUBLICATION Logical replication slot active Short-term fix: If pathman tables are not in use, drop the extension or unload the pathman shared library; restart the replication slot.\nLong-term fix: Upgrade to PG14+ (tested — extremely fast with no lag).\n# References # https://www.postgresql.org/message-id/flat/17716-1fe42e7b44fc2f25%40postgresql.org\nhttps://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d71\n","date":"Jun 26, 2025","externalUrl":null,"permalink":"/en/2025/06/26/case-grant-authorization-causes-walsender-to-hang/","section":"Posts","summary":"Symptoms # The walsender’s LSN stopped advancing. The stack trace showed it was stuck in pathman’s invalidate_psin_entries_using_relid, with the relid constantly changing and the walsender CPU pegged at 100%.\npstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=42319501) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=42319501) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=\u003coptimized out\u003e, relid=42319501) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x3a391c8) at inval.c:595 #5 0x000000000071d50e in ReorderBufferExecuteInvalidations (rb=0x1b63ff8, txn=0x1be5f58, txn=0x1be5f58) at reorderbuffer.c:2238 #6 ReorderBufferCommit (rb=0x1b63ff8, xid=xid@entry=4285897514, commit_lsn=405674661986920, end_lsn=\u003coptimized out\u003e, commit_time=commit_time@entry=799377897828299, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1819 #7 0x0000000000712d18 in DecodeCommit (xid=4285897514, parsed=0x7fffaadf8630, buf=0x7fffaadf87f0, ctx=0x1a359e8) at decode.c:637 #8 DecodeXactOp (ctx=0x1a359e8, buf=buf@entry=0x7fffaadf87f0) at decode.c:245 #9 0x00000000007130b2 in LogicalDecodingProcessRecord (ctx=0x1a359e8, record=0x1a35c80) at decode.c:114 #10 0x0000000000733662 in XLogSendLogical () at walsender.c:2885 #11 0x0000000000735942 in WalSndLoop (send_data=send_data@entry=0x733620 \u003cXLogSendLogical\u003e) at walsender.c:2287 #12 0x0000000000736692 in StartLogicalReplication (cmd=0x1846c68) at walsender.c:1213 #13 exec_replication_command (cmd_string=cmd_string@entry=0x181a288 \"START_REPLICATION SLOT \\\"lzl_logical_rep\\\" LOGICAL 170F5/7C3EAE78 (\\\"proto_version\\\" '1', \\\"publication_names\\\" 'lzl_logical_rep')\") at walsender.c:1640 #14 0x0000000000774e91 in PostgresMain (argc=\u003coptimized out\u003e, argv=argv@entry=0x1866478, dbname=0x18662b8 \"lzldb\", username=\u003coptimized out\u003e) at postgres.c:4325 #15 0x0000000000485989 in BackendRun (port=\u003coptimized out\u003e, port=\u003coptimized out\u003e) at postmaster.c:4526 #16 BackendStartup (port=0x18635b0) at postmaster.c:4210 #17 ServerLoop () at postmaster.c:1739 #18 0x0000000000702f08 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1814da0) at postmaster.c:1412 #19 0x000000000048660a in main (argc=3, argv=0x1814da0) at main.c:210 ## Second execution, same stack, different relid pstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=26560221) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=26560221) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=\u003coptimized out\u003e, relid=26560221) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x39f1f68) at inval.c:595 ... Analysis # The changing relid showed that the walsender was still running, not dead. The LSN was not advancing, so we analyzed the LSN position to see what the transaction was doing.\n","title":"Case: GRANT Authorization Causes Walsender to Hang","type":"posts"},{"content":"(For memory basics, refer to Linux Memory Analysis; this article covers memory knowledge above that foundation)\nMemory Basic Concepts # buddy # The process of buddy system allocating and merging pages is omitted.\nEasily overlooked knowledge points:\nThe prerequisite for buddy merging two blocks of the same size is that their physical addresses are contiguous The merge algorithm is iterative: after merging at the current level, it will automatically attempt to merge larger blocks. This means compactd is not strictly required for merging page table \u0026amp; PTE # page table and PTE are actually two different concepts, and they are easily confused because both generally refer to page tables. Below is relevant knowledge about page table and PTE[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]\nPTE stores the physical address of the page frame \u0026ldquo;page table\u0026rdquo; and \u0026ldquo;Page Table\u0026rdquo; are different concepts: \u0026ldquo;page table\u0026rdquo; refers to the pages that maintain the mapping between linear addresses and physical addresses, while \u0026ldquo;Page Table\u0026rdquo; refers to pages in the upper-level page table pte_t, pmd_t, pud_t, pgd_t describe Page Table Entry, Page Middle Directory entry, Page Upper Directory entry, and Page Global Directory entry respectively PTE is Page Table Entry If you only look at the size of the pagetable used by the MMU to cache virtual-to-physical memory mappings, confusing pagetable with PTE doesn\u0026rsquo;t make much difference. However, if you need to go deep into page table directories, you need to separate the two concepts.\nTLB # Each level of the page table is stored in memory. To complete a single virtual-to-physical address translation, all four page tables corresponding to the current virtual address must be found. This means a single memory IO requires looking up the page table in memory 4 times just for virtual-to-physical address translation. Translation Lookaside Buffers (TLB) are caches specifically designed to accelerate virtual-to-physical address translation.\nRegarding the TLB\u0026rsquo;s location, it is usually in the L1 cache (some say it\u0026rsquo;s in registers or L2, which likely depends on the CPU architecture; for now, just consider it as CPU cache, distinct from main memory)1:\nIn modern processors, the L1 cache is typically divided into multiple parts, including data cache dTLB and instruction cache iTLB. Frequently modifying page tables leads to increased main memory accesses, causing the CPU to frequently flush the TLB cache[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]. The TLB also has a finite size; improving TLB hit rate can reduce accesses to the main memory pagetable. Using huge pages can reduce PTEs by three orders of magnitude, greatly reducing TLB misses.[^ 《深入理解Linux进程和内存》 (Understanding Linux Processes and Memory)].\nTLB information:\n#cpuid -l L1 TLB/cache information: 2M/4M pages \u0026amp; L1 TLB (0x80000005/eax): L1 TLB/cache information: 4K pages \u0026amp; L1 TLB (0x80000005/ebx): ... L2 TLB/cache information: 2M/4M pages \u0026amp; L2 TLB (0x80000006/eax): Observing TLB hit rate:\nperf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses -I 10000 -p $PM_PID During memory reclamation, TLB misses do increase, but it\u0026rsquo;s hard to establish a causal relationship. TLB miss is just one observation metric for the MMU — TLB is part of MMU.\nReverse Mapping # The general principles of PFRA (Page Frame Reclaiming Algorithm)[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]:\nFirst, release \u0026ldquo;harmless\u0026rdquo; pages. Start by reclaiming harmless pages in the pagecache — pages not occupied by any process All pages of user-mode processes are candidates for reclamation. FRPA will gradually deprive user-mode pages with longer sleep times of their page frames Cancel the mapping of all page table entries for a shared page frame, then reclaim that shared page frame Only reclaim \u0026ldquo;unused\u0026rdquo; pages One of PFRA\u0026rsquo;s goals is to be able to release shared page frames. The process of quickly locating all page table entries pointing to the same page frame is called reverse mapping.\nReverse mappings for shared\nAnonymous pages File-mapping pages Basic tricks of page frame reclaiming\nLRU lists Free cheapest pages first Unmap all at once Etc2 Huge Pages # Enabling huge pages provides certain performance improvements for specific application workloads. In PostgreSQL, enabling huge pages on large-memory instances also offers some performance gains and even some stability benefits.\nWhy are huge pages better?3:\nReduced TLB pressure Reduced pagetable size in main memory Huge pages are physically contiguous. Contiguous physical memory access is better than non-contiguous physical memory access When using these kinds of larger pages, higher level pages can directly map them, with no need to use lower level page entries[^ kernel.org,mm pagetables] However, using huge pages brings management challenges:\nHuge pages need to be pre-allocated Huge page size must be calculated in advance to avoid memory waste Two ways for processes to use huge pages:\nThe first is by using shmget() to setup a shared region backed by huge pages the second is the call mmap() on a file opened in the huge page filesystem C Library and System Calls # The middle layer between kernel space and user space is the system call layer. Application Programming Interfaces (APIs) and system calls are different. Applications call APIs implemented in user space to program, rather than directly executing system calls. In the UNIX world, the most common system call layer is the POSIX standard (Portable Operation System Interface of UNIX). The POSIX standard targets APIs, not system calls. The Linux operating system\u0026rsquo;s API is typically provided in the form of C standard libraries, such as libc. The C standard library provides implementations for most POSIX APIs.[^《奔跑吧 Linux内核 入门篇（第2版）》 (Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition)]\nC app-\u0026gt;C lib-\u0026gt;system calls-\u0026gt;OS-\u0026gt;hardware4:\nCommon C library and system calls:\nmalloc,free=\u0026gt;C lib\nmmap、brk、munmap=\u0026gt;system calls\nPage Fault Exception # Page fault exceptions (or page fault interrupts) need to distinguish two cases: exceptions caused by programming errors; and physical page allocation behavior triggered by using virtual address space where physical page frames haven\u0026rsquo;t been allocated yet.[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]\nExceptional page fault: Segment Fault — each virtual memory area has associated permissions. If a process accesses a memory area outside its valid range, or illegally accesses a memory area, or accesses a memory area in an incorrect manner, the processor reports a page fault exception. In severe cases, it reports a \u0026ldquo;Segment Fault\u0026rdquo; and terminates the process[^《奔跑吧 Linux内核 入门篇（第2版）》 (Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition)].\nNormal page fault: System calls like mmap and brk manage virtual memory; they don\u0026rsquo;t directly allocate physical memory. Virtual memory system call functions only establish the process address space. Virtual memory is visible in user space, but no mapping between virtual memory and physical memory has been established. When a process accesses virtual memory where no mapping has been established, a page fault interrupt is triggered.[^《奔跑吧 Linux内核 入门篇（第2版）》 (Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition)]\nPage faults are also divided into two types:\nminor fault: the page fault was handled without blocking the current process, and a page frame was allocated\nmajor fault: the page fault forced the current process to sleep (likely because filling the page frame with data from disk took time). A page fault that blocks the current process is a major fault[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]\nCopy-On-Write (COW) # When the fork system call is executed, the child process and parent process have independent process address spaces but share physical memory resources, including process context, process stack, memory information, file descriptors, directories, resource limits, etc. Only the parent process\u0026rsquo;s page table needs to be copied to the child process. At this point, sharing is read-only. When writing is needed (when running their respective tasks), data is copied, giving the parent and child processes their own copies.[^《奔跑吧 Linux内核 入门篇（第2版）》 (Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition)]\nFor PostgreSQL\u0026rsquo;s multi-process model, fork itself isn\u0026rsquo;t heavy — you may only need to worry about page tables — but the various tasks that come after fork will trigger copy-on-write to create the child process\u0026rsquo;s own resource copies.\nNote the distinction between copy-on-write and page fault exceptions: copy-on-write refers to resources not being allocated to the child process at fork time; page fault exceptions refer to physical memory allocation occurring for this process, unrelated to fork.\nmmap, brk \u0026amp; Shared Memory Mapping Area, Heap Area # The functions and memory address regions used by mmap and brk are different:\nmmap is used to manage shared memory, corresponding to the shared memory mapping area brk is used to manage private memory, corresponding to the heap area Linear address region functions:\nmmap: The mapping area expands top-down. The mmap mapping area and heap expand toward each other until they exhaust the remaining space in the virtual address space. This structure facilitates the C runtime library\u0026rsquo;s use of the mmap mapping area and heap for memory allocation. Stack: Stores local variables and function parameters during program execution, grows from high addresses to low addresses Heap: Dynamic memory allocation area, managed through functions like malloc, new, free, and delete BSS (Uninitialized Variables): Stores uninitialized global variables and static variables Data: Stores global variables and static variables with predefined values in source code Text (Code): Stores read-only program execution code, i.e., machine instructions. Shared memory mapping area and heap area5:\nReal postmaster heap and shared memory mapping:\ncat /proc/1063005/smaps |grep -E \u0026#34;\\-s|heap\u0026#34; 022a4000-022ee000 rw-p 00000000 00:00 0 [heap] 7fef6019e000-7fef601a5000 rw-s 00000000 00:17 21 /dev/shm/PostgreSQL.1291978332 7fef601a5000-7fef6098b000 rw-s 00000000 00:01 1052 /dev/zero (deleted) #this is shared buffers 7fef6e238000-7fef6e239000 rw-s 00000000 00:01 10 /SYSV0011f702 (deleted) You can see the heap and shared memory area addresses roughly match.\nVM # Linux kernel virtual memory subsystem\nDirectory: cd /proc/sys/vm/\ncompact # concept \u0026amp; param # Memory compaction is a mechanism in the Linux kernel for solving memory fragmentation problems. It improves the allocation and compaction efficiency of large contiguous memory pages by merging free physical pages.\nParameter Function Default/Range compact_memory Manually trigger a global memory compaction operation Write 1 to trigger compaction_proactiveness Controls the frequency of proactive compaction Parameter available since 4.x. 0-100 (default 20) compact_unevictable_allowed Whether to allow compaction of unreclaimable pages (e.g., mlock locked memory) Parameter available since 4.x. 0 (disable) or 1 (allow) defrag_mode Controls the trigger strategy for memory defragmentation Parameter available since 4.x. 0-3. 0 disables automatic compaction; 1 defers passive compaction. Default in 3.10 is 1 extfrag_threshold Threshold for triggering compaction when large memory blocks are insufficient 0-1000 (default 500) There are 3 compaction modes (depending on kernel version support):\nPassive compaction: extfrag_threshold addresses \u0026ldquo;already occurred\u0026rdquo; fragmentation problems — triggered when a process requests large memory blocks and finds them insufficient. Proactive compaction: compaction_proactiveness proactively controls compaction aggressiveness, optimizing \u0026ldquo;not yet occurred\u0026rdquo; but potential fragmentation risks. Manual compaction: compact_memory. extfrag_threshold is the Linux kernel parameter controlling passive compaction. When the kernel fails to allocate high-order contiguous physical memory (e.g., huge pages), it determines the failure cause via the fragmentation index:\n-1: Allocation succeeded (watermark satisfied) 0: Failed due to insufficient memory 1000: Failed due to fragmentation View specific values via /sys/kernel/debug/extfrag/extfrag_index. The output is a floating-point number (e.g., 0.854), but the actual range is magnified 1000x, so 0.854 corresponds to an actual value of 854:\ncat /sys/kernel/debug/extfrag/extfrag_index |grep Normal Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.995 0.998 If extfrag_threshold=600, compaction is triggered when the fragmentation index \u0026gt; 600. extfrag_index is quite useful and can assist buddy in observing fragmentation issues.\ndirty # concept \u0026amp; param # Dirty page flushing is somewhat similar to memory reclamation and is also divided into asynchronous and synchronous:\nAsynchronous flushing: performed by background threads like pdflush/flush/kdmflush; application writes are not affected Synchronous flushing: directly blocks the application process; the process that initiated the write operation flushes the dirty pages itself Parameter Name Description Default dirty_background_bytes Background async flush threshold, in bytes 0 (disabled) dirty_background_ratio Background async flush threshold, as percentage 10% dirty_bytes Synchronous flush threshold, in bytes 0 (disabled) dirty_ratio Synchronous flush threshold, as percentage 20-40% dirty_expire_centisecs Maximum lifetime of dirty pages in memory 3000 (30s) dirty_writeback_centisecs Frequency of kernel periodic dirty page state checks 500 (5s) xxx_bytes and xxx_ratio parameters are mutually exclusive.\nExample parameters and flowchart:\ndirty_background_bytes 0 dirty_background_ratio 10 dirty_bytes 0 dirty_ratio 40 dirty_expire_centisecs 3000 dirty_writeback_centisecs 500 %% Dirty page flushing flow diagram integrating time parameters graph TD A[App writes generate dirty pages] --\u0026gt; B{Check interval reached?\u0026lt;br\u0026gt;dirty_writeback_centisecs every 5s} B -- No --\u0026gt; D{Expired dirty pages exist?\u0026lt;br\u0026gt; dirty_expire_centisecs\u0026gt;30s} B -- Yes --\u0026gt; C{Dirty page threshold check} C --\u0026gt; E[Dirty page ratio? dirty_background_ratio\u0026gt;10% ] C --\u0026gt; F[Dirty page ratio? dirty_ratio\u0026gt; 40%] E -- Trigger --\u0026gt; G[Background async flush] F -- Trigger --\u0026gt; H[Synchronous flush] D -- Trigger --\u0026gt; G G --\u0026gt; I[Dirty pages written to disk] H --\u0026gt; I[Dirty pages written to disk] I --\u0026gt; J[Free memory space] The configuration principles for dirty page flush parameters are basically the same as PostgreSQL dirty page flush parameters. Setting them too low causes overly frequent flushing — the same dirty page may be written to disk multiple times, wasting IO. Setting them too high may cause IO storms.\nObserving Dirty Pages # Monitoring dirty pages:\nps -eo pid,%cpu,%mem,wchan,args,state|grep kdmflush|grep -E -w -v \u0026#34;S\u0026#34; #Observe async flush process state cat /proc/vmstat| grep -E -w \u0026#34;nr_dirty|nr_writeback\u0026#34; #vmstat dirty, page count cat /proc/meminfo |grep -i dirty #meminfo dirty, KB Testing dirty pages with dd:\ngrep -E \u0026#34;nr_dirty_threshold|nr_dirty_background_threshold\u0026#34; /proc/vmstat | awk \u0026#39;{printf \u0026#34;%s: %.2fGB\\n\u0026#34;, $1, ($2*4)/1048576}\u0026#39; nr_dirty_threshold: 141.28GB nr_dirty_background_threshold: 35.32GB dd if=/dev/zero of=testfile bs=8k count=128000 # cache io Failed test (same result after multiple tests):\nNo RUNNING kdmflush process observed Dirty pages were flushed before reaching 35GB or 30S threshold Timestamp nr_dirty nr_dirty(GB) Trend Simulation 17:00:18 2,757 0.01052 ▍ 17:00:19 336,199 1.282 ████▌ 17:00:25 1,984,867 7.574 ██████████████▍ 17:00:32 4,252,177 16.22 ████████████████████ 17:00:33 3,699,227 14.11 █████████████████▊ 17:00:38 170,865 0.652 ▎ 17:00:46 2,865,814 10.93 █████████▋ 17:00:54 4,721,827 18.01 ██████████████████████ 17:00:55 3,876,509 14.79 ██████████████████ 17:01:03 835,097 3.186 ██▊ os dirty != pg dirty # With pg fsync=on, data writes go through the OS pagecache before specific blocks are written to disk. PostgreSQL has its own dirty pages, and the OS also has dirty pages. What\u0026rsquo;s the relationship between the two?\n## Observation commands cat /proc/meminfo |grep -E -w \u0026#34;Dirty\u0026#34; # OS dirty pages select isdirty,pinning_backends,count(*) from pg_buffercache where isdirty is true group by isdirty,pinning_backends; # PG dirty pages checkpoint; begin; --Observe insert into tlzl select generate_series(1,1000000); --Observe commit; --Observe checkpoint; --Observe Test results:\nstage dirty in pg OS dirty Clean state 0 0.02-2M fluctuating After insert completion 200M Rose to 1.7G, then dropped to 20KB After commit 200M 0.02-2M fluctuating After checkpoint flush 0 0.02-2M fluctuating When the insert data size is increased, OS dirty rises during insert, rising to the GB level and then fluctuating.\nPG dirty has some relation to OS dirty but they\u0026rsquo;re not entirely correlated. When PG inserts data, OS dirty does rise, but after the OS flushes its own dirty pages, PG\u0026rsquo;s dirty pages remain dirty. Preliminary judgment: dirty pages in shared memory are unrelated to OS dirty. It\u0026rsquo;s yet to be determined whether the OS dirty increase comes from PG\u0026rsquo;s private memory dirty pages.\nswappiness # Controls the kernel\u0026rsquo;s bias toward reclaiming memory from the anonymous memory pool or the page cache. Essentially, it controls whether swapping anonymous pages or reclaiming file pages imposes a lower cost for the upper-layer application. For example, for compute-oriented applications using more dynamic allocation or private memory, a lower swappiness should be set; for data-dependent applications, a higher swappiness should be set to reduce the impact of flushing file pages on data access. However, all of this depends on the efficiency of swap IO and filesystem IO6. It all sounds ideal, but when swapping occurs, it very likely means performance degradation.\nswappiness=0 # When swappiness=0, the kernel will only swap when memory reaches the high watermark7. The specific strategy also relates to the kernel version and NUMA. What can be confirmed is that swappiness=0 does not mean swap is disabled — swapoff -a is what disables the swap functionality.\n#Check if swap is enabled swapon --show free -h |grep Swap cat /proc/swaps grep -E \u0026#39;swap|none\u0026#39; /etc/fstab cat /proc/meminfo|grep Swap #Monitor whether swapping is occurring cat /proc/vmstat|grep swp sar -W 1 inconsistent swap behavior # The OS-level /proc/sys/vm/swappiness has little-to-no effect on the swap behavior of cgroups v1 systems (has little-to-no effect on the swap). This issue can lead to inconsistent swap behavior8.\nOccurrence conditions (all must be true):\nvm.swappiness != cgroups memory.swappiness cgroups v1 Cause:\nsystemd creates cgroups early during startup, before sysctl.service loads /etc/sysctl.conf. vm.swappiness cannot constrain cgroup memory.swappiness. The issue is: when the OS swap behavior and cgroup behavior differ, which one takes effect?\nSolutions:\nfor cgroup v1, set vm.swappiness = all cgroups memory.swappiness for cgroup v1, many solutions available, see https://access.redhat.com/solutions/6785021 Use cgroup v2. v2 adds the vm.force_cgroup_v2_swappiness parameter, which disables cgroup\u0026rsquo;s memory.swappiness memory overcommitment # concept \u0026amp; param # Linux does not reserve physical memory for every virtual address; instead, it allocates memory only when actually needed. Overcommitment can limit the total virtual memory size that all processes can request. When the requested memory exceeds the defined physical memory size, it\u0026rsquo;s called overcommit.\nThere are three overcommit policy parameters: overcommit_memory, overcommit_ratio/overcommit_kbytes\nThe overcommit_memory parameter controls the overcommitment policy:\n0 (default): Heuristic overcommitment policy, allows slight overcommit. CommitLimit = physical memory + swap. 1: No overcommit check 2: Strict limit, prohibits exceeding CommitLimit graph TD A[Memory allocation request] --\u0026gt; B{Overcommit mode} B --\u0026gt;|Mode 0: Heuristic| C[\u0026#34;Allow moderate virtual memory overcommit\u0026#34;] B --\u0026gt;|Mode 1: Unlimited| D[\u0026#34;Virtual memory commits unconstrained\u0026#34;] B --\u0026gt;|Mode 2: Strict| E[\u0026#34;Virtual memory total ≤ CommitLimit\u0026#34;] C --\u0026gt; F[Allocate physical pages on demand at runtime] D --\u0026gt; G[May exhaust physical memory + Swap] E --\u0026gt; H[Enforce virtual memory total control] When overcommit_memory=2, only one of the overcommit_ratio and overcommit_kbytes parameters takes effect. The CommitLimit is calculated as follows: $$ CommitLimit = (RAM - huge page memory) × \\frac{overcommit_ratio}{100} + SwapTotal $$ or $$ CommitLimit = (RAM - huge page memory) + overcommit_kbytes + SwapTotal $$ Interesting overcommit accounting9 — mmap, brk, fork are all accounted for, which clearly affects PostgreSQL:\nStatus ------ o\tWe account mmap memory mappings o\tWe account mprotect changes in commit o\tWe account mremap changes in size o\tWe account brk o\tWe account munmap o\tWe report the commit status in /proc o\tAccount and check on fork o\tReview stack handling/building on exec o\tSHMfs accounting o\tImplement actual limit enforcement Reserve Memory and Overcommit # user_reserve_kbytes: When overcommit_memory=2, physical memory reserved for ordinary user processes. When system memory is severely insufficient, it ensures ordinary users can still perform basic operations (like starting new processes, handling memory allocation requests). Default is min(3% of the current process size, 128M). When set to 0, a single process can allocate (all free memory - admin_reserve_kbytes)\nadmin_reserve_kbytes: Physical memory reserved for users with CAP_SYS_ADMIN privileges (typically root user), ensuring admin recovery capability — reserved physical memory ensuring the system administrator can log in and execute commands. Default is min(3% memory, 8MB). When using strict overcommit mode, it\u0026rsquo;s best to increase this parameter.\n$ cat user_reserve_kbytes 131072 $ cat admin_reserve_kbytes 8192 Observing Overcommit # grep -E \u0026#39;CommitLimit|Committed_AS\u0026#39; /proc/meminfo sar -r 1 $ grep -E \u0026#39;CommitLimit|Committed_AS\u0026#39; /proc/meminfo CommitLimit: 203103492 kB Committed_AS: 252170700 kB $ sar -r 1 07:32:35 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 07:32:37 PM 25472180 370249056 93.56 14588 274485956 252242936 62.91 233866528 103568816 12924 07:32:38 PM 25471904 370249332 93.56 14588 274487888 252242740 62.91 233851748 103570136 11180 Metric meanings:\nmeminfo CommitLimit: CommitLimit calculated from physical memory, Swap, and overcommit parameters meminfo Committed_AS: Total virtual memory currently requested by all processes sar -r kbcommit = Committed_AS sar -r %commit = kbcommit / total physical memory smaps or status can also show total requested virtual memory, but directly summing smaps/status total virtual memory double-counts shared library files and mapped files (like mmap), while Committed_AS only counts memory requested via mmap, brk, fork, etc., and does not double-count shared memory. The two have different calculation scopes. For total virtual memory, just look at Committed_AS or kbcommit.\nwatermark # Parameter Name Description Introduced Default Unit/Range min_free_kbytes Defines the minimum free memory the system reserves, directly affecting the watermarks watermark[min] calculation, ensuring the system retains enough memory for critical operations when memory is tight Early kernel versions KB watermark_scale_factor Globally adjusts the memory watermark gap (high-low and low-min) Linux kernel 4.x (exact minor version unknown) 10 (0.1% physical memory) Max 3000 (30% physical memory) watermark_boost_factor Temporarily raises the high watermark (high), triggering aggressive memory reclamation to reduce fragmentation Linux kernel 4.x (exact minor version unknown) 15000 (i.e., 1.5x original high watermark) min_free_kbytes # ## Calculate total min and other values from zoneinfo cat /proc/zoneinfo | grep -E -w \u0026#34;min|low|high\u0026#34;|grep -E -v \u0026#34;high:\u0026#34;| awk \u0026#39; /min/ { total_min += $2 } /low/ { total_low += $2 } /high/ { total_high += $2 } END { printf \u0026#34;Total min: %d KB\\nTotal low: %d KB\\nTotal high: %d KB\\n\u0026#34;, total_min * 4, total_low * 4, total_high * 4; }\u0026#39; Total min: 15828844 KB Total low: 19786048 KB Total high: 23743260 KB #Current system min value cat min_free_kbytes 15828849 Because there are other zones, the total min across all zones is approximately equal to min_free_kbytes. The Normal zone\u0026rsquo;s min is definitely slightly smaller than min_free_kbytes; you only need to focus on the Normal zone:\n## Normal zone min, low, high settings; page=4k cat /proc/zoneinfo | grep -A 50 Normal | grep -E \u0026#34;min|low|high\u0026#34; min 3931615 low 4914518 high 5897422 Before Linux kernel 4.6, min, low, and high had a fixed ratio, and you could only change low and high values by setting min_free_kbytes. min:low:high = 1:1.25:1.5.\nProblems with the fixed ratio:\nIdeally, you\u0026rsquo;d want to raise low to more proactively trigger kswapd async reclamation and lower min to reduce direct reclaim. Before 4.6, you could only indirectly adjust low/high by adjusting min, using min to adjust kswapd\u0026rsquo;s delta working buffer. For example:\nkswapd async reclamation working buffer (low-min) kswapd async reclamation workload (high-low) min=1GB, low=1.25GB, high=1.5GB 0.25GB 0.25GB min=10GB, low=12.5GB, high=15GB 2.5GB 2.5GB Raising min is done to raise low and high.\nAn excessively low min value causes kswapd to not have time to asynchronously reclaim more memory before direct reclaim triggers. An excessively high min not only wastes memory but also causes more frequent reclamation activity, resulting in higher sys CPU usage. The default difference between low and min in Linux indeed seems a bit small.\nwatermark_scale_factor # Wouldn\u0026rsquo;t it be great if you could directly adjust min, low, and high? Sorry, the Linux kernel doesn\u0026rsquo;t support that (Android has extra_free_kbytes). But\u0026hellip;\nSince Linux kernel 4.x, the watermark_scale_factor parameter was added, allowing adjustment of the ratios between parameters — the ratio is no longer fixed. Its default value is 10, corresponding to 0.1% of memory (10/10000), with a maximum of 3000. When set to 1000, it means the difference between \u0026ldquo;low\u0026rdquo; and \u0026ldquo;min\u0026rdquo;, and between \u0026ldquo;high\u0026rdquo; and \u0026ldquo;low\u0026rdquo;, will both be 10% of memory size (1000/10000).\n0.1% is clearly too small — for 1TB of memory, the scale is only 1GB.\nwatermark_boost_factor # watermark_boost_factor is used to optimize external memory fragmentation. It temporarily raises the zone\u0026rsquo;s watermark, i.e., zone-\u0026gt;watermark_boost, thereby raising the zone\u0026rsquo;s high watermark (WMARK_HIGH). This allows kswapd to reclaim more memory, making it easier for the memory compaction module (compactd kernel thread) to merge large blocks of contiguous physical memory. The default value of watermark_boost_factor is 15000, meaning the original high watermark is temporarily raised to 150%. Setting this to 0 disables the mechanism for temporarily raising zone watermarks10\noom # The OOM Killer is a kernel module, not a process.\nParameter Name Description Default panic_on_oom Controls system behavior when OOM occurs: 0: Don\u0026rsquo;t trigger panic, start OOM Killer 1: Trigger panic and halt 2: Trigger panic then attempt memory release 0 oom_kill_allocating_task Whether to preferentially kill the process that triggered OOM (rather than traversing the process tree to select the optimal target): 0: Disabled 1: Enabled 0 oom_dump_tasks Whether to dump all task information when OOM occurs (for post-mortem analysis): 0: Disabled 1: Enabled 1 oom_score # When OOM occurs, the system needs to decide which process to kill based on the OOM score. Each user process has 3 OOM score interface files:\n-rw-r--r-- 1 postgres postgres 0 May 24 16:39 /proc/63766/oom_adj -r--r--r-- 1 postgres postgres 0 May 24 16:39 /proc/63766/oom_score -rw-r--r-- 1 postgres postgres 0 May 24 16:39 /proc/63766/oom_score_adj oom_score is a dynamically calculated OOM score by the system, influenced at least by:\nMany child processes: +points Long-running: -points Low nice value: +points (nice value represents process CPU time slice priority. Lower nice values mean higher priority, more CPU time slice allocation) Direct hardware access: -points11 In addition to the Linux-calculated OOM score, adjustments (adj) can be manually applied. oom_adj is from earlier Linux kernel versions; it\u0026rsquo;s best to adjust OOM scores through the oom_score_adj interface file.\nParameter/File Purpose Example Values oom_score Kernel-calculated raw score (dynamic) 0~1000 oom_score_adj User-defined adjustment value, directly affects final score -1000~1000; -1000 equivalent to disabling OOM oom_adj (legacy) Legacy adjustment parameter, range -17~15 -17~15 lowmem_reserve_ratio # Besides min_free_kbytes, there\u0026rsquo;s another minimum memory reserve parameter that can cause process memory allocation failures, but their functions differ significantly.\nlowmem_reserve_ratio is a key kernel parameter used to protect low-end memory (DMA, DMA32) from being excessively consumed by high-end memory allocation requests. lowmem_reserve_ratio is just a coefficient, not a directly usable number; the kernel calculates the reserved page count for each zone.\n#Default values below cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32 Memory zones are ordered by priority from low to high: DMA → DMA32 → Normal → HighMem. Allocation requests from higher-priority zones can \u0026ldquo;borrow\u0026rdquo; memory from lower-priority zones, but must reserve a certain proportion of memory for use by the lower-priority zones.\ncat /proc/zoneinfo |grep -Ew \u0026#34;Node 0|protection|free\u0026#34; Node 0, zone DMA pages free 3976 protection: (0, 2484, 386430, 386430) Node 0, zone DMA32 pages free 415741 protection: (0, 0, 383946, 383946) Node 0, zone Normal pages free 5658528 protection: (0, 0, 0, 0) For example, DMA\u0026rsquo;s protection indicates:\n0: Allocation from this zone, no cross-zone allocation restrictions 2484: Pages DMA reserves for DMA32 zone allocations 386430: Pages DMA reserves for Normal zone allocations 386430: Reserved extension field, meaningless in this context Based on these settings:\nWhen DMA32 zone requests memory from DMA zone, 3976 \u0026gt; 2484, it may succeed When Normal zone requests memory from DMA zone, 3976 \u0026lt; 386430, it will not succeed Requests from lower zones to higher zones are not subject to this restriction misc # A few more related parameters; those with less relevance are not listed:\nParameter Purpose nr_hugepages Number of huge pages nr_overcommit_hugepages Overcommit of huge pages; The maximum is nr_hugepages + nr_overcommit_hugepages nr_hugepages_mempolicy NUMA-localized huge page allocation hugetlb_shm_group Shared memory permission control hugetlb_optimize_vmemmap Restructure huge page metadata management model, reducing memory usage of huge page metadata (struct page). Supported since Linux kernel 5.13 max_map_count Limits the maximum number of memory mapping regions (VMA) a single process can have, default 65530 zone_reclaim_mode Memory reclamation policy under NUMA, e.g., allocating memory from other nodes stat_interval VM stat refresh frequency, default 1 second vfs_cache_pressure Parameter for VFS (Virtual File System) cache reclamation pressure, mainly affecting the aggressiveness of kernel reclaiming dentry and inode caches page-cluster Swap readahead, swaps multiple pages to swap partition at once. Default 3, i.e., 8 pages at once OS Memory Observation and Calculation # /proc/meminfo, /proc/vmstat, /proc/zoneinfo all contain memory information, much of it duplicative. I won\u0026rsquo;t list the differences — a glance tells you what\u0026rsquo;s what.\nfree available Calculation (Unfinished) # General direction: (NR_FREE_PAGES + NR_FILE_PAGES - NR_SHMEM + NR_SWAP_PAGES + NR_SLBA_RECLAIMABLE - TOTALRESERVE_PAGES - root reserved memory)\nThe kernel has its own estimated available memory. Directly calculating the available value using a formula is difficult to get exactly right:\n## Not very accurate, don\u0026#39;t use cat /proc/meminfo |grep -Ew \u0026#34;MemFree|Active\\(file\\)|Inactive\\(file\\)|SwapFree|SReclaimable|nr_shmem|Shmem\u0026#34; |awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2} NR==3 {c=$2} NR==4 {d=$2} NR==5 {e=$2} NR==6 {f=$2 ;print (a+b+c+d-e+f)}\u0026#39; ; cat /proc/meminfo |grep -Ew \u0026#34;MemAvailable\u0026#34;; inactive_anon + active_anon != anon # Why?\nPrimary: Shmem separately counts shared memory pages. nr_anon_pages does not include shared memory pages, while nr_inactive_anon and nr_active_anon include anonymous shared memory pages Secondary: anon includes some Unevictable pages (Mlocked is a subset of Unevictable) Other minor statistical differences have little impact A rough but relatively accurate formula: nr_inactive_anon + nr_active_anon + nr_unevictable - nr_shmem\n## Applicable under huge pages; not applicable under NUMA ## /proc/meminfo, /proc/zoneinfo, /proc/vmstat can all be used for calculation #/proc/vmstat echo -n \u0026#34;anon_computed : \u0026#34;;cat /proc/vmstat|egrep -w \u0026#34;nr_inactive_anon|nr_active_anon|nr_unevictable|nr_shmem\u0026#34;| awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2} NR==3 {c=$2} NR==4 {d=$2; print (a+b+c-d)}\u0026#39; ;\\ echo -n \u0026#34;anon_real : \u0026#34;;cat /proc/vmstat|egrep -w \u0026#34;nr_anon_pages\u0026#34;|awk \u0026#39;{print $2}\u0026#39; anon_computed : 15776924 anon_real : 15772671 ##/proc/zoneinfo Normal echo -n \u0026#34;anon_normal_computed : \u0026#34;; cat /proc/zoneinfo |grep Normal -A 50|egrep -w \u0026#34;nr_inactive_anon|nr_active_anon|nr_unevictable|nr_shmem\u0026#34;| awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2} NR==3 {c=$2} NR==4 {d=$2; print (a+b+c-d)}\u0026#39; ;\\ echo -n \u0026#34;anon_normal_real : \u0026#34;; cat /proc/zoneinfo |grep Normal -A 50|egrep -w \u0026#34;nr_anon_pages\u0026#34;|awk \u0026#39;{print $2}\u0026#39; anon_normal_computed : 15711170 anon_normal_real : 15707402 cache Calculation # The buff/cache shown in the free command can be calculated from file pages or cache itself:\necho -n \u0026#34;filepage+shmem: \u0026#34;;cat /proc/meminfo |grep -Ew \u0026#34;Buffers|Active\\(file\\)|Inactive\\(file\\)|Shmem|SReclaimable\u0026#34;| awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2} NR==3 {c=$2} NR==4 {d=$2} NR==5 {e=$2 ;print (a+b+c+d+e)}\u0026#39;;\\ echo -n \u0026#34;cached: \u0026#34;;cat /proc/meminfo |grep -Ew \u0026#34;Buffers|Cached|SReclaimable\u0026#34; | awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2} NR==3 {c=$2 ;print (a+b+c)}\u0026#39;;\\ free -k; #Execution results: filepage+shmem: 289417584 cached: 289419156 total used free shared buff/cache available Mem: 395721236 79633516 26668564 84704912 289419156 178501152 Swap: 5242876 0 5242876 Controversy: Does shmem Count as cache? # Clearly, the calculation above includes shmem in cache. Theoretically, shmem shouldn\u0026rsquo;t be part of cache.\nIn fact, the kernel community has discussed thisWhy is Shmem included in Cached in /proc/meminfo?, wanting to remove shared memory from cache:\n\u0026gt; -\tcached = global_node_page_state(NR_FILE_PAGES) - \u0026gt; -\ttotal_swapcache_pages() - i.bufferram; \u0026gt; +\tcached = global_node_page_state(NR_FILE_PAGES) - \u0026gt; +\ttotal_swapcache_pages() \u0026gt; +\t- i.bufferram - i.sharedram; But modifying this involves forward compatibility concerns. The question comes down to: which is more important — forward compatibility or improving the accuracy of a piece of information?\nCurrently, there\u0026rsquo;s no good resolution; that\u0026rsquo;s the status quo.\nThe email thread also discusses some interesting things:\nAnother point of view is that everything in tmpfs is part of the page cache and can be written out to swap - Dirty: total amount of RAM used to buffer data to be written on permanent storage (dirty). Gets converted to Cached when write is complete. (Actually I would call this \u0026#34;Buffers\u0026#34; but Dirty is okay, too.) - Cached: total amount of RAM used to improve *performance* that can be *immediately dropped* without any data-loss – note that this includes all untouched RAM backed by swap. - Shared: total amount of RAM shared between multiple process that cannot be freed even if any single process gets killed. (If this is even possible to know - note that this would *only* contain COW pages in practice. We already have Committed_AS which is about as good for real world heuristics.) cache does not include dirty pages, and can be directly dropped without data loss tmpfs is swapout Shared memory appears to be swapout, which is clearly different from cache pages that can be directly dropped. PostgreSQL\u0026rsquo;s shared memory clearly cannot be directly dropped.\nSo for PostgreSQL, the fact that cache contains shared memory is quite important — don\u0026rsquo;t assume by default that it doesn\u0026rsquo;t.\nMemory Page Statistics Often Don\u0026rsquo;t Add Up # When calculating memory pages, some calculations don\u0026rsquo;t add up. Summary of reasons:\nshmem is counted in cache Cannot see file-mapped and anonymous-mapped pages within shmem nr_anon_pages does not include shared memory pages, while nr_inactive_anon and nr_active_anon include anonymous shared memory pages VM and cgroup have slightly different statistical scopes cgroup v1 # cgroup Memory Management # cgroup can observe and limit the usage of anonymous pages, file pages, swap cache, and kernel memory. Each memcg has its own independent LRU; there is no concept of a GLOBAL LRU.\ncgroup memory management differs from cgroup CPU management. A task can request lots of CPU work; reaching the cgroup CPU limit can extend execution time to handle it. However, the memory a task occupies is working memory — a task uses the same physical memory.\nKey differences between cgroup CPU and memory management:\nMemory must be managed through reuse and reclamation; a task\u0026rsquo;s working memory is truly occupied and cannot be used by other tasks. CPU is managed through time allocation; other tasks or cgroups can use it. Memory needs to be instantly available; CPU works through time slicing — time can be dispersed. CPU control\u0026rsquo;s core is time allocation; Memory Control\u0026rsquo;s core is page counting. The core of the design is a counter called the page_counter. The page_counter tracks the current memory usage and limit of the group of processes associated with the controller\nMemory Control\u0026rsquo;s core is page counting, meaning it\u0026rsquo;s not that physical pages are statically assigned. The memory allocated this time, when released back to free after use, most likely won\u0026rsquo;t be the same physical page next time12.\nPhysical pages know which cgroup they belong to:\n+--------------------+ | mem_cgroup | | (page_counter) | +--------------------+ / ^ \\ / | \\ +---------------+ | +---------------+ | mm_struct | |.... | mm_struct | | | | | | +---------------+ | +---------------+ | + --------------+ | +---------------+ +------+--------+ | page +----------\u0026gt; page_cgroup| | | | | +---------------+ +---------------+ mm_struct represents virtual memory. Each virtual memory knows which cgroup it belongs to; each physical page can point to page_cgroup, meaning it knows which cgroup this physical memory belongs to12.\ncgroup Parameters and Metrics # cgroup uses interface files for configuration and viewing memory usage.\nDirectory: cd /sys/fs/cgroup/memory/xxx/\nKernel memory and mem+swap can have separate settings or usage viewing:\nmemory.kmem.xxx #kernel mem memory.memsw.xxx #mem+swap Below, we only look at mem-related items.\nInterface files can be divided into three categories:\nRead-only — show usage, permissions: -r--r--r-- Read-write — control parameters, permissions: -rw-r--r-- Other — special settings, permissions: other Specific meanings are as follows, with important parameters highlighted:\nType Interface File Meaning Read-only memory.numa_stat NUMA-dimensional memory stats Read-only memory.stat Important, the primary memory usage interface file with many metrics; analyzed separately below Read-only memory.usage_in_bytes usage_in_bytes is affected by the method and doesn\u0026rsquo;t show \u0026rsquo;exact\u0026rsquo; value of memory. Not recommended for viewing cgroup memory usage Read-only memory.failcnt Number of times memory usage exceeded memory.limit_in_bytes, cumulative Read-write cgroup.clone_children Controls whether child cgroups inherit parent configuration Read-write cgroup.procs Used to manage process groups (process IDs, PIDs) in the current cgroup. For multi-process PostgreSQL, this means writing all PG processes, including management processes and backends, into the procs file Read-write tasks Used to manage threads (thread IDs, TIDs) in the current cgroup. When writing a process PID to cgroup.procs, all its thread TIDs are automatically added to tasks Read-write notify_on_release Controls whether a release operation is triggered when the last task (process or thread) in the cgroup exits. Would only be enabled for container management; traditional cgroup management keeps it disabled by default. Cgroups should be preserved after database restart Read-write memory.move_charge_at_immigrate Deprecated in v2. Charge attribution rules when migrating cgroups Read-write memory.use_hierarchy Whether parent cgroup limits child cgroups Read-write memory.limit_in_bytes cgroup memory upper limit Read-write memory.soft_limit_in_bytes Reclaim the portion exceeding the soft limit Read-write memory.max_usage_in_bytes cgroup usage peak, an observation metric Read-write memory.oom_control oom_kill_disable 1 — disable OOM\nunder_oom 0 — whether currently in OOM state Read-write memory.swappiness cgroup-level swappiness Other memory.force_empty Write only; writing 0 forces release of all cgroup memory Other cgroup.event_control Event notification interface, listens for memory pressure events, requires programming. Often used with memory.pressure_level Other memory.pressure_level Memory pressure notification level Using a PG instance to explain the meaning of various metrics in memory.stat.\nThis PG instance is configured as:\nshared_memory_type=mmap shared_buffers=64GB approximately 800 clients, running cat memory.stat cache 345587761152 #page cache!!! rss 27332608 #Anonymous and swap cache memory size. Note: differs from OS process RSS; clearly doesn\u0026#39;t include PG shared memory rss_huge 0 #of bytes of anonymous transparent hugepages. Note: anonymous huge pages mapped_file 61491769344 #File shared memory size; includes PG shared memory here swap 0 #On swap partition pgpgin 389395357 #rss+cache charged pages pgpgout 305016672 #rss+cache uncharged pages pgfault 1954040341 #Omitted pgmajfault 17 #Omitted inactive_anon 165728256 #anonymous and swap cache memory on inactive LRU active_anon 61549518848 #anonymous and swap cache memory on active LRU list inactive_file 138240962560 #file-backed on inactive LRU list active_file 145658613760 #file-backed memory on active LRU list unevictable 0 #Unreclaimable memory hierarchical_memory_limit 408021893120 # hierarchical_memsw_limit 9223372036854771712 # total_xxx #hierarchical Roughly (ignoring swap), cache+rss = inactive_anon+active_anon+inactive_file+active_file.\nThese values are quite convoluted. cache+rss doesn\u0026rsquo;t have a straightforward correspondence with [in]active_anon/file, and mapped_file (shared memory) is hard to categorize, making it easy to get confused. Combining various documentation and testing, I hand-rolled the following script:\n#cginfo_lzl echo -n \u0026#34;shared_mem_mapped : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;mapped_file\u0026#34;| awk \u0026#39;{print $2 / 1024 / 1024 /1024 }\u0026#39; ;\\ echo -n \u0026#34;shared_mem_anon : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;rss|inactive_anon|active_anon\u0026#34;| awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2} NR==3 {c=$2; print (b + c -a)/1024/1024/1024}\u0026#39; ;\\ echo -n \u0026#34;pagecache : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;cache\u0026#34;| awk \u0026#39;{print $2 / 1024 / 1024 /1024 }\u0026#39; ;\\ echo -n \u0026#34;pagecache_cache-share : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;cache|mapped_file\u0026#34;| awk \u0026#39;NR==1 {a=$2} NR==2 {b=$2; print (a - b)/1024/1024/1024}\u0026#39;;\\\\n echo -n \u0026#34;file_total : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;inactive_file|active_file\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;anon_total : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;inactive_anon|active_anon\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;total_used_rss+map : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;rss|mapped_file\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;total_mem_file+rss+map : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;inactive_file|active_file|rss|mapped_file\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;total_mem_rss+cache : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;rss|cache\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;total_mem_anon+file : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;inactive_file|active_file|inactive_anon|active_anon\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;total_memsw : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.stat|egrep -w \u0026#34;rss|cache|swap\u0026#34;| awk \u0026#39;{sum += $2} END {print sum /1024/1024/1024}\u0026#39;;\\\\ echo -n \u0026#34;hard_limit : \u0026#34;;cat /sys/fs/cgroup/memory/$PGNAME/memory.limit_in_bytes| awk \u0026#39;{print $1 / 1024 / 1024 /1024 }\u0026#39; #Database with shared_buffers=2GB shared_mem_mapped : 1.69063 shared_mem_anon : 1.69828 pagecache : 5.94717 pagecache_cache-share : 4.25654 file_cache : 4.24889 anon_cache : 3.23096 total_used_rss+map : 3.2233 total_mem_file+rss+map : 7.47219 total_mem_rss+cache : 7.47984 total_mem_anon+file : 7.47984 total_memsw : 7.47984 hard_limit : 8 Differences Between cgroup RSS and Process RSS # #shared_buffers= 64GB, all PG process RSS sorted ps -eo pid,ppid,rss,args |grep `cat $PGDATA/postmaster.pid|head -1`|sort -k3 -rn 97632 97627 61103720 postgres: lzlinst: checkpointer 97633 97627 59045152 postgres: lzlinst: background writer 97627 1 2322820 /paic/postgres/base/11.3/bin/postgres -D /paic/pg6888/data 97637 97627 85116 postgres: lzlinst: pgsentinel 97697 97627 19620 postgres: lzlinst: dbmgr users [local] idle 97634 97627 17932 postgres: lzlinst: walwriter 250063 97627 14508 postgres: lzlinst: dbmon postgres [local] idle 97636 97627 13220 postgres: lzlinst: stats collector 248777 97627 11576 postgres: lzlinst: dbmon postgres [local] idle 97635 97627 2980 postgres: lzlinst: autovacuum launcher 97638 97627 2376 postgres: lzlinst: logical replication launcher 97630 97627 1592 postgres: lzlinst: logger 250185 39130 972 grep --color=auto 97627 Generally, the PG processes with the highest RSS values are checkpointer and bgwriter, because RSS represents actual memory used, including shared memory, and these two processes that flush shared buffer dirty pages occupy the most. Backends with excessive data queries may also have higher RSS values, but this is usually caused by data extracts or slow full-scan queries.\nWhy is postmaster\u0026rsquo;s RSS so small? Because postmaster itself doesn\u0026rsquo;t need to do much shared_buffer operations; it only needs to open up the shared memory virtual address space and fork it for other processes to use.\nPM\u0026rsquo;s child processes have the same shared memory address but not necessarily the same RSS:\n$ cat /proc/97632/smaps |grep -A 3 \u0026#34;zero\u0026#34; #checkpointer 2b4fd87cf000-2b60a2143000 rw-s 00000000 00:04 15925397 /dev/zero (deleted) Size: 70411728 kB Rss: 61087812 kB Pss: 31429895 kB $ cat /proc/97633/smaps |grep -A 3 \u0026#34;zero\u0026#34; #bgwriter 2b4fd87cf000-2b60a2143000 rw-s 00000000 00:04 15925397 /dev/zero (deleted) Size: 70411728 kB Rss: 59043388 kB Pss: 29394787 kB $ cat /proc/97627/smaps |grep -A 3 \u0026#34;zero\u0026#34; #postmaster 2b4fd87cf000-2b60a2143000 rw-s 00000000 00:04 15925397 /dev/zero (deleted) Size: 70411728 kB Rss: 2318408 kB Pss: 1741764 kB Above, checkpointer and bgwriter occupy the most RSS, and most of their RSS is shared memory. These two processes almost evenly split the entire actually-used shared memory, while postmaster doesn\u0026rsquo;t use much. PM and all its forked child processes have the same shared memory virtual address.\nBut cgroup RSS is only a few tens of MB, far less than process RSS:\ncat /sys/fs/cgroup/memory/lzlinst/memory.stat |egrep -w \u0026#34;rss|mapped_file\u0026#34; rss 88997888 mapped_file 52963262464 You can see that PG shared memory is not in the cgroup stat RSS. cgroup RSS doesn\u0026rsquo;t count file pages or shared file pages.\nlinux kernel12:\nOnly anonymous and swap cache memory is listed as part of \u0026lsquo;rss\u0026rsquo; stat. This should not be confused with the true \u0026lsquo;resident set size\u0026rsquo; or the amount of physical memory used by the cgroup.\nProcess vs. cgroup memory statistics differences13:\nMemory Single Process Process cgroup(memcg) cache None PageCache mapped_file None file_rss + shmem_rss RSS anon_rss + file_rss ＋ shmem_rss anon_rss For PostgreSQL, the RSS in stat does not include file map shared memory. The PG official documentation describes mmap as anonymous shared memory:\nPossible values are mmap (for anonymous shared memory allocated using mmap), sysv (for System V shared memory allocated via shmget)\ncgroup counts PG mmap memory as mapped_file.\nObserving sysv and huge page scenarios, summary of PG\u0026rsquo;s memory.stat metrics:\nRSS in stat does not include file map shared memory. Observation shows that regardless of mmap or sysv, RSS does not contain PG shared memory Similarly, rss_huge also does not include file map shared huge page memory. Observation shows that even with huge pages enabled, stat does not contain PG shared memory Without huge pages, PG shared memory (mmap or sysv) is all counted under memory.stat mapped_file; with huge pages, it\u0026rsquo;s in none of the stat metrics, including rss_huge Where Exactly Is mapped_file? # mapped_file is in cache, and also in inactive_anon+active_anon mapped_file can also be anonymous; both mmap and sysv are counted here #Database with shared_buffers=2GB shared_mem_mapped : 1.69063 shared_mem_anon : 1.69828 pagecache : 5.94717 pagecache_cache-share : 4.25654 file_cache : 4.24889 anon_cache : 3.23096 total_used_rss+map : 3.2233 total_mem_file+rss+map : 7.47219 total_mem_rss+cache : 7.47984 total_mem_anon+file : 7.47984 total_memsw : 7.47984 hard_limit : 8 soft_limit_in_bytes # Soft limit (memory.soft_limit_in_bytes) is a non-enforced constraint in cgroup memory management. When a cgroup\u0026rsquo;s memory usage exceeds the soft limit, the system does not immediately force memory reclamation. Instead, it will preferentially reclaim the excess memory of that cgroup when global memory pressure is high (e.g., when overall system free memory is insufficient).\nTrigger condition: Global memory pressure (e.g., insufficient system free memory). Call path: kswapd → balance_pgdat → check cgroup soft limits → trigger reclamation. Reclamation target: Preferentially reclaim memory pages from cgroups exceeding their soft limits. +-------------------+ Global memory pressure detection +-------------------+ | kswapd thread | ------------------------------------\u0026gt; | balance_pgdat | +-------------------+ +-------------------+ | | Traverse memory zones and check v +---------------------------+ | Check each cgroup\u0026#39;s soft | | limit usage | +---------------------------+ | | Trigger reclamation for over-limit cgroups v +---------------------------+ | Page reclamation (LRU list | | scanning, etc.) | +---------------------------+ The soft_limit_in_bytes mechanism is very similar to high. In v2, soft_limit_in_bytes has been deprecated, replaced by three new parameters: min, low, and high.\nImpact of Overselling on pagecache # To be discussed later\ncg oom # Normally, if sharedbuffer = 1/4 of cg mem, then without counting private memory, pagecache can reach up to 3/4 of cg mem. Generally, normal business private memory usage won\u0026rsquo;t be very high. If cg mem is full, memory can be reclaimed from cg pagecache (this is direct memory reclamation; AliOS has implemented async background reclamation: Memcg Background Async Reclamation). So the best way to test cg oom is to use sessions that consume lots of private memory rather than stress testing.\nTest case:\n#Observe score -r--r--r-- 1 postgres postgres 0 May 24 16:39 /proc/63766/oom_score rss # whichever command you like ## A SQL that can consume lots of private memory, many union alls create many plan nodes psql -d lzldb -tX -c \u0026#34;create table lzl1(col1 varchar(1));\u0026#34; psql -tX -c \u0026#34;\\o sqltext.sql\u0026#34; -c \u0026#34; SELECT \u0026#39;select col1 from lzl1\u0026#39; || \u0026#39; union all\u0026#39; FROM generate_series(1, 100000) UNION ALL SELECT \u0026#39;select col1 from lzl1;\u0026#39; FROM generate_series(1, 1); \u0026#34; #Adjust stack parameter otherwise SQL will be aborted psql -d lzldb -c \u0026#34;set max_stack_depth=1024000\u0026#34; -f sqltext.sql cg oom off:\nwchan shows OOM information, even an oom score, but the process won\u0026rsquo;t be killed by the OOM killer\n## vm oom enabled; 0: don\u0026#39;t trigger panic, start OOM Killer $ cat /proc/sys/vm/panic_on_oom 0 ## cg oom disabled; 1: disable oom $ cat /sys/fs/cgroup/memory/$PGNAME/memory.oom_control oom_kill_disable 1 under_oom 0 $ ps -eo user,ppid,pid,state,%cpu,%mem,stime,wchan:14,args,rss,vsz,sig_block |grep `head -1 $PGDATA/postmaster.pid` |grep -v grep postgres 19005 870 D 0.0 0.0 10:54 mem_cgroup_oom postgres: pg3ymhp2: lzluser 7216 2807460 0000000000000000 postgres 19005 3417 S 0.0 0.0 10:55 pipe_wait postgres: pg3ymhp2: lzluser 22944 2808540 0000000000000000 postgres 19005 13069 D 0.0 0.0 11:10 mem_cgroup_oom postgres: pg3ymhp2: lzluser 11944 2808348 0000000000000000 postgres 19005 13104 D 0.0 0.0 11:10 mem_cgroup_oom postgres: pg3ymhp2: lzluser 12224 2808348 0000000000000000 postgres 19005 14352 D 0.0 0.0 11:10 mem_cgroup_oom postgres: pg3ymhp2: lzluser 11680 2808348 0000000000000000 cat /sys/fs/cgroup/memory/$PGNAME/memory.oom_control oom_kill_disable 1 under_oom 1 cat /proc/97994/oom_score 11 shared_mem_mapped : 2.00019 shared_mem_anon : 2.0023 pagecache : 2.0023 pagecache_cache-share : 0.00211334 file_cache : 0 anon_cache : 8 total_used_rss+map : 7.99789 total_mem_file+rss+map : 7.99789 total_mem_rss+cache : 8 total_mem_anon+file : 8 total_memsw : 8 hard_limit : 8 Currently, it appears that PG processes may also crash when unable to allocate memory. For example, if walwriter crashes, it can cause all other processes to crash.\ncg oom on:\nUser processes are killed due to high OOM score, sent kill -9. Most PG processes crash; postmaster reset_shared() then automatically restarts other processes. Both message and dmesg show out-of-memory information:\n#cg oom enabled oom_kill_disable 0 pg log: 2025-05-29 19:10:45.945 CST,,,198877,,6838374d.308dd,4,,2025-05-29 18:30:37 CST,,0,LOG,00000,\u0026#34;server process (PID 236413) was terminated by signal 9: Killed\u0026#34;,\u0026#34;Failed process was running: select col1 from lzl1 union all message: May 29 19:10:45 lzlhost kernel: Memory cgroup stats for /t1lzldb: cache:8392988KB rss:8384228KB rss_huge:0KB mapped_file:7458316KB swap:0KB inactive_anon:1310184KB active_anon:15467032KB inactive_file:0KB active_file:0KB unevictable:0KB May 29 19:10:45 lzlhost kernel: Memory cgroup out of memory: Kill process 236413 (postgres) score 497 or sacrifice child dmesg: [Thu May 29 18:26:27 2025] Memory cgroup stats for /t1lzldb: cache:8392988KB rss:8384228KB rss_huge:0KB mapped_file:7458316KB swap:0KB inactive_anon:1310184KB active_anon:15467032KB inactive_file:0KB active_file:0KB unevictable:0KB [Thu May 29 18:26:27 2025] Memory cgroup out of memory: Kill process 236413 (postgres) score 497 or sacrifice child [Thu May 29 18:26:27 2025] Killed process 236413 (postgres) total-vm:18828736kB, anon-rss:8328252kB, file-rss:2328kB, shmem-rss:1832kB Management differences between cg oom on and off for PG databases:\non: cg oom killer will kill processes with high OOM score, typically user processes off: cg oom killer won\u0026rsquo;t start. PG processes will hang — they may recover on their own, but PG\u0026rsquo;s critical processes (like walwriter) might crash due to insufficient memory, and the instance may still go down. Note: this is cg oom, not vm oom. System-level vm oom is determined by the system-level vm overcommit mechanism.\ncg v1 Problems # No cg pagetable statistics No cg slab statistics No cg hugepage statistics (hugepages are not charged, not just not counted) No cg async/sync page reclamation statistics cg RSS and process RSS have different statistical scopes shmem statistics are messy What\u0026rsquo;s New in V2 # V2 Officially released in Linux 4.5 (March 2016)14.\ncgroup v2 memory management improvements and changes:15\ncg mem interface file vs v1 Meaning memory.current Reworked Current memory usage. Removes the less useful usage_in_bytes memory.min New Different from VM\u0026rsquo;s min/low/high. VM watermarks are about remaining OS memory; cg v2 watermarks are about cg memory used. memory.min is a hard memory protection value, default 0. Even when the system has no reclaimable memory, memory at or below this boundary won\u0026rsquo;t be reclaimed16 memory.low New Best-effort memory protection value, default 0. System preferentially reclaims memory from unprotected cgroups. If still insufficient, reclaims memory between memory.min and memory.low. memory.high New Memory reclamation warning threshold, default max. When cgroup memory usage reaches high, triggers synchronous memory reclamation for this cgroup and children, trying to keep memory below high memory.max Reworked Equivalent to memory.limit_in_bytes memory.reclaim Reworked Active reclamation interface file. v1 only had memory.force_empty for forced clearing memory.peak Reworked Equivalent to max_usage_in_bytes; exceeding peak triggers cg oom killer memory.oom.group New Controls whether cg OOM killer terminates the entire cgroup (1) or just a single process (0). Default 0. If oom_score_adj=-1000, process won\u0026rsquo;t be killed memory.events New Reports memory-related events memory.stat Reworked Many changes, analyzed separately memory.zswap.current, memory.zswap.max, memory.zswap.writeback New Zswap is a compressed swap mechanism in the Linux kernel. Through compressing memory pages awaiting swap, it reduces disk I/O operations, improving system performance. Its core idea is to compress swap data that would have been written to disk and temporarily store it in memory, only writing data to physical swap devices (like swap partitions or files) when necessary soft_limit_in_bytes Removed memory.oom_control Removed This means v2 cannot directly disable cg oom killer; however, fine-grained memory management can be achieved through min/low/high settings and event memory notifications v2 cg mem management advantages:\nCompared to v1, v2 has simpler and clearer hierarchical management v1 only had OOM kill or freeze; v2 has more means to control memory size (such as memory.min/low/high) v2 makes it easier to control burst loads17 Removes the interface file for directly disabling cg oom killer Adds memory_hugetlb_accounting memory.stat:\nParameter Meaning v1 Counterpart anon Anonymous pages active_anon+inactive_anon file File pages, including tmpfs active_file+inactive_file kernel (npn) Total kernel memory, including kernel_stack, pagetables, percpu, vmalloc, slab, and other kernel memory usage. New kernel_stack Memory occupied by kernel stacks. New pagetables page tables New sec_pagetables Secondary page tables, suitable for VMs, GPU devices, network acceleration cards, and other hardware resource isolation scenarios New percpu (npn) Memory size used for per-cpu kernel data structures New sock (npn) network transmission buffers New vmalloc (npn) vmalloc New shmem Including tmpfs, shm, shared anonymous mmap New zswap Memory consumed by zswap compression itself New zswapped Amount of user memory zswapped New file_mapped mmap() size Somewhat similar to v1 mapped_file, though mapped_file includes tmpfs, shm file_dirty Same as v1 dirty file_writeback Same as v1 writeback swapcached Same as v1 swapcached anon_thp Anonymous pages in transparent huge pages New file_thp File pages in transparent huge pages New shmem_thp Transparent huge pages for shm, tmpfs, anonymous mmap New inactive_anon, active_anon, inactive_file, active_file, unevictable Same as v1 slab_reclaimable As the name suggests New slab_unreclaimable As the name suggests New slab (npn) As the name suggests New workingset_refault_anon, workingset_refault_file, workingset_activate_anon, workingset_activate_file, workingset_restore_anon, workingset_restore_file, workingset_nodereclaim Refaulted page statistics New pswpin (npn) swap in Same as v1 pgpgin pswpout (npn) swap out Same as v1 pgpgout pgscan (npn) scanned pages (in an inactive LRU list) New pgsteal (npn) Reclaimed memory New pgscan_kswapd (npn) As the name suggests New pgscan_direct (npn) As the name suggests New pgscan_khugepaged (npn) Pages scanned by the transparent huge page daemon New pgscan_proactive (npn) Pages scanned proactively New pgsteal_kswapd (npn), pgsteal_direct (npn), pgsteal_khugepaged (npn), pgsteal_proactive (npn) As the name suggests; pgsteal\\* corresponds to pgscan\\* New pgfault (npn) As the name suggests Same as v1 pgfault pgmajfault (npn) As the name suggests Same as v1 pgmajfault pgrefill (npn) Pages scanned in active LRU New pgactivate (npn) Pages moved to active LRU New pgdeactivate (npn) Pages moved to inactive LRU New pglazyfree (npn) Pages whose release is deferred when under memory pressure New pglazyfreed (npn) Reclaimed lazyfree pages New swpin_zero,swpout_zero zero-filled pages; during Swap In, when the kernel detects page content is all zeros (Zero-filled), marks the page as \u0026ldquo;zero page\u0026rdquo; in metadata, skipping disk I/O New zswpin,zswpout,zswpwb zswap-related pages New thp_fault_alloc (npn), thp_collapse_alloc (npn), thp_swpout (npn), thp_swpout_fallback (npn) Transparent huge page-related pages New numa_pages_migrated (npn), numa_pte_updates (npn), numa_hint_faults (npn) NUMA-related pages; also memory.numa_stat exists New pgdemote_kswapd, pgdemote_direct, pgdemote_khugepaged, pgdemote_proactive Unclear what demote means New hugetlb Huge pages New v2 cg mem observation advantages:\nAdds slab, pagetable, pgscank/pgscand/pgsteal, and huge page info — none of which v1 had More observation metrics related to specific features, such as sock, vmalloc, transparent huge pages, zswap compression interactions, swap_zero zero-fill interactions, etc. Shared memory shmem and file_mapped metrics are separated wchan # Waiting Channel, name of the kernel function in which the process is sleeping\nGenerally, you should check the wchan of processes in D state to see what kernel function the process is waiting on.\n-: Running tasks will display a dash (\u0026rsquo;-\u0026rsquo;) in this column\npoll_schedule_timeout: Common for PM, usually in running state\nzz ***Fri May 2 04:50:10 CST 2025 postgres 141378 1 19 0.5 0.4 70585180 2322876 poll_schedule_timeout S 21:06:18 00:02:40 /paic/postgres/base/11.3/bin/postgres -D /paic/pg6888/data zzz ***Fri May 2 04:50:43 CST 2025 postgres 141378 1 19 0.5 0.4 70585180 2322876 - R 21:06:18 00:02:42 /paic/postgres/base/11.3/bin/postgres -D /paic/pg6888/data futex_wait_queue_me: Common for SLEEP processes. Occasionally D state\npostgres 455358 141378 19 4.7 1.0 70590684 5349576 futex_wait_queue_me S 03:01:12 00:02:47 postgres: t1lzldb: lzl test3 30.181.32.3(39801) COMMIT hugetlb_fault: Only seen when huge pages are first loaded and load starts up\ndo_last: Function in the VFS (Virtual File System) path resolution logic, responsible for handling the last component of a file path (such as filename or symbolic link) and triggering actual file operations\nlock_page_killable: Lock a physical memory page in an interruptible manner. \u0026ldquo;Interruptible\u0026rdquo; means the process is allowed to respond to fatal signals like SIGKILL while waiting for the page lock\nrpc_wait_bit_killable: This function relates to the Remote Procedure Call (RPC) mechanism, used in the kernel to wait for changes to certain bit flags\nwait_on_page_bit: Wait for changes to page flag states (e.g., PG_locked, PG_writeback)\nblkdev_issue_flush: Block device layer cache flush function. Possible call chain: user calls fsync() → file system (e.g., ext4) submits relevant dirty pages to the block device layer → calls blkdev_issue_flush() to ensure device cache is flushed\non_proc_exit: Register cleanup functions for process exit\nima_file_check: Belongs to the IMA (Integrity Measurement Architecture) subsystem, used to verify file integrity during file access; typically involved with open() calls\nflush_work: Wait for task completion\ncall_rwsem_down_write_failed: When attempting to acquire a write lock (down_write()) fails, this function handles write lock contention and waiting logic. It uses spin or sleep mechanisms to make the current process wait for lock release (rwsem: read-write semaphore)\nget_request: Appears when iowait is high. Gets a free request structure (struct request) from the block device request queue. If the queue is full (device processing speed insufficient), the thread waits until a request is available\nlookup_slow: Slow path for VFS (Virtual File System) path resolution\n/** * lookup_fast - do fast lockless (but racy) lookup of a dentry * @nd: current nameidata * * Do a fast, but racy lookup in the dcache for the given dentry, and * revalidate it. Returns a valid dentry pointer or NULL if one wasn\u0026#39;t * found. On error, an ERR_PTR will be returned. */ static struct dentry *lookup_fast(struct nameidata *nd) /* Fast lookup failed, do it the slow way */ static struct dentry *__lookup_slow(const struct qstr *name, struct dentry *dir, unsigned int flags) static struct dentry *lookup_slow(const struct qstr *name, struct dentry *dir, unsigned int flags) { struct inode *inode = dir-\u0026gt;d_inode; struct dentry *res; inode_lock_shared(inode); res = __lookup_slow(name, dir, flags); inode_unlock_shared(inode); return res; } lookup_fast and lookup_slow both search for dentries and return them. lookup_fast searches in the dentry cache; if it fails, lookup_slow is used.\nStress testing with huge pages enabled, no direct memory reclamation, the following events occurred:\nlock_page: Appears when iowait is high. When the kernel attempts to lock a memory page, if the page is already locked by another thread/process, the current thread enters a waiting state.\nvx_svar_sleep_unlock, vx_ilock, vx_bc_biowait, vx_dio_physio, vx_rwsleep_lock:\nvx is a journaling file system developed by Veritas (now owned by Symantec and subsequently spun off as Veritas Technologies), designed for high-performance, high-availability large-scale data storage, primarily targeting enterprise application scenarios. Like xfs and ext4, it is a type of file system.\npipe_wait: When a process attempts to read from or write to a pipe, if the pipe buffer is full (write operation) or empty (read operation), the current thread enters sleep state, waiting for buffer state changes\npipe_write: Entry function for pipe write operations. When the buffer is full, the thread sleeps in this function, waiting for writable space\ncongestion_wait: When the block device I/O queue is congested (e.g., request queue full or device processing delayed), the kernel uses this function to briefly sleep the thread\nwait_iff_congested: Checks whether the block device queue is congested and enters brief sleep if so. Similar to congestion_wait but more lightweight, typically used in memory reclamation or dirty page writeback paths\nmem_cgroup_oom_synchronize: When usage_in_bytes reaches limit_in_bytes, marks oom_control.under_oom=1. Whether the OOM killer kernel module is activated depends on oom_control.oom_kill_disable\nmem_cgroup_oom: Same as mem_cgroup_oom_synchronize\nrmap_walk # One of PFRA\u0026rsquo;s goals is to reclaim shared page frames. To achieve this, the Linux 2.6 kernel can quickly locate all page table entries pointing to the same page frame — this process is called reverse mapping[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)].\nWhen a page frame already referenced by one process is inserted into another process\u0026rsquo;s page table entries (fork), rmap_walk should also occur\nzcat hostlzl_ps_25.04.08.0900.dat.gz|egrep \u0026#34;\\-D /dirlzl/pg5998/data|zzz\u0026#34;|less zzz ***Tue Apr 8 09:10:50 CST 2025 postgres 209987 1 19 0.2 0.5 70247548 2117844 poll_schedule_timeout S 22:17:21 00:01:56 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data zzz ***Tue Apr 8 09:11:20 CST 2025 postgres 209987 1 19 0.2 0.5 70247548 2117844 poll_schedule_timeout S 22:17:21 00:01:56 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data zzz ***Tue Apr 8 09:13:08 CST 2025 postgres 209987 1 19 0.2 0.5 70247548 2117844 - D 22:17:21 00:01:57 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data postgres 225076 209987 19 1.6 0.0 70247548 1720 rmap_walk D 09:11:51 00:00:01 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data postgres 224924 209987 19 0.7 0.0 70247548 1728 rmap_walk D 09:11:46 00:00:00 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data postgres 224817 209987 19 0.5 0.0 70247548 1720 try_to_unmap_file D 09:11:44 00:00:00 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data zzz ***Tue Apr 8 09:19:16 CST 2025 postgres 209987 1 19 0.3 0.5 70247548 2117884 poll_schedule_timeout S 22:17:21 00:02:00 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data postgres 250875 209987 19 0.0 0.0 70247548 2208 - R 09:19:17 00:00:00 /dirlzl/postgres/base/postgressqlbin/postgresdb -D /dirlzl/pg5998/data zzz ***Tue Apr 8 09:19:48 CST 2025 postgres 209987 1 19 0.3 0.5 70247548 2117884 poll_schedule_timeout S 22:17:21 00:02:01 /dirlzl/postgres/base/postgressql/bin/postgresdb -D /dirlzl/pg5998/data try_to_unmap_file # The try_to_unmap_file() function calls try_to_unmap_cluster(), and try_to_unmap_cluster() scans all page table entries corresponding to linear addresses in that linear region, attempting to clear them[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]. try_to_unmap_file() performs reverse mapping of mapped pages. Note: reverse mapping means finding all VMAs through the page table and reclaiming shared physical page frames.\npage_referenced # referenced and active are used to control page activity level and are used in page reclamation. When refcount=0, it indicates free pages or pages about to be released[^《奔跑吧 Linux内核 入门篇（第2版）》 (Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition)].\nIn kernel.org doc\u0026rsquo;s Object-Based Reverse Mapping, there is a description of the page_referenced() function3:\npage_referenced() which checks all PTEs that map a page to see if the page has been referenced recently\npage_referenced() calls page_referenced_obj() which is the top level function for finding all PTEs within VMAs that map the page.\nIf a page is mapped and it is referenced through the mapping, index hash table, this bit is set. It is used during page replacement for moving the page around the LRU lists\nIn short, page_referenced() finds all PTEs\u0026rsquo; VMAs that map a page through the page frame. This is also a reverse mapping process.\nLinux introduced two page flags, PG_active and PG_referenced, to identify the activity level of pages, thereby deciding how to move pages between two lists (active LRU and inactive LRU).\nPG_active is used to indicate whether the page is currently active — if this bit is set, the page is active. PG_referenced is used to indicate whether the page has been accessed recently — each time the page is accessed, this bit is set.\npage_referenced(): When the operating system performs page reclamation, each time a page is scanned, this function is called to set the page\u0026rsquo;s PG_referenced bit. If a page\u0026rsquo;s PG_referenced bit is set but the page is not accessed again within a certain time, its PG_referenced bit will be cleared.18.\nMemory Observation Metrics # View basic memory settings:\nObserve memory metrics:\nSome Questions # Do kswapd and Direct Memory Reclamation Execute Together? # Yes. If it\u0026rsquo;s watermark-triggered memory reclamation, pgscand is often accompanied by pgscank; the reverse is not necessarily true. If both pgscank and pgscand are frequent, consider adjusting memory reclamation watermarks, increasing the delta to prevent it from being quickly breached.\nHowever, there\u0026rsquo;s another case: when fragmentation rate is high and free memory is still plentiful, blocking memory compaction may be directly triggered with pgscand but no pgscank at all. In this case, adjusting watermarks won\u0026rsquo;t help. Consider enabling huge page memory and increasing shared buffer hit rate to reduce frequent pagecache allocation that fragments memory.\nImpact of Oversized pagetable on Memory Reclamation # An oversized pagetable increases the cost and time of reverse mapping. During direct memory reclamation, reverse mapping is needed to find all processes\u0026rsquo; virtual address spaces (VMAs), then cancel the VMA page table mappings of all processes. This means: the more processes, the larger the pagetable, and the slower the memory reclamation.\nThe more PostgreSQL processes, the larger the pagetable; the larger shared buffer, the larger the pagetable.\nEnabling huge page memory can reduce pagetable size by 500x (4k=\u0026gt;2M), not only freeing up memory but also improving memory reclamation efficiency.\nHow Large Should shared buffers Be? # sharedbuffers = 1/4 cgmem seems to have become an industry standard, but the actual situation is far more complex. Theoretically, reducing sharedbuffers a bit can increase pagecache a bit, actually slightly increasing total cache size. Increasing sharedbuffers a bit slightly reduces total cache size but improves sharedbuffer hit rate somewhat. Clearly, making sharedbuffers too large is bad, and making it too small is also bad. If sharedbuffers is too small, PG\u0026rsquo;s own working memory becomes too small, effectively offloading memory management to the OS — OS pagecache reclamation will also affect performance. If sharedbuffers is too large, not only is pagecache squeezed, but PG\u0026rsquo;s dirty page flushing impact must also be considered, especially for write-heavy scenarios where corresponding bgwriter parameters need adjustment.\nFrom rough stress testing:\nWithout huge pages, shared buffers = min(1/4 MEM, 20GB) With huge pages, shared buffers = min(1/4 MEM, 60GB) Is the Difference Between Processes and Threads Really Not That Big? # Any Linux kernel material will say that the difference between processes and threads is not significant. Whether creating a process or a thread, the kernel uses the same function, kernel_clone, to implement it. The only difference lies in the parameters passed. The fork and clone system calls are roughly the same[^ 《深入理解Linux进程和内存》 (Understanding Linux Processes and Memory)]:\nDimension Process Thread childID Each process has an independent pid (process ID) Each thread has a tid (thread ID), but the thread\u0026rsquo;s pid is the same as its process\u0026rsquo;s pid. Address Space Each process has an independent address space (mm_struct), including memory, stack, etc. Threads share the address space of their process; all threads\u0026rsquo; mm_struct points to the same address space. File System Each process has its own fs_struct, including file descriptors, mount points, etc. Threads share their process\u0026rsquo;s fs_struct; all threads\u0026rsquo; file descriptors and mount points are the same as the process. Compared to processes, threads are only slightly \u0026ldquo;lighter\u0026rdquo;. Overall, the similarities between processes and threads outweigh their differences.\nHowever, when the number of processes increases, the difference becomes significant, especially for multi-process applications like PostgreSQL:\nEach process has its own VMA, so more address spaces need to be maintained Each process has its own pagetable, so pagetables consume more memory Multiple processes increase TLB flush overhead, while threads do not Process switching requires more context switch overhead, while threads do not Inter-process communication (IPC) is less efficient, while threads can directly share memory without IPC communication issues You could say: processes and threads don\u0026rsquo;t differ much at creation time, but multi-process management and multi-thread management differ greatly.\nWhy Does the Standby Have PG-Level Dirty Pages? # The standby\u0026rsquo;s WAL replay mechanism itself generates dirty pages, and the standby also flushes dirty pages. You can view standby dirty pages through pg_buffercache. The standby\u0026rsquo;s dirty pages are different from the primary\u0026rsquo;s — standby dirty data is also just regular relations. You can also observe that the standby\u0026rsquo;s checkpoint/bgwriter/backend dirty flushing is different from the primary\u0026rsquo;s.\nWhy Is File Cache Higher on Some Databases and Lower on Others? # Generally, databases with high data dispersion have more file cache. Simple slow SQL queries are unlikely to maintain high file cache levels long-term. A slow SQL query accessing lots of data might briefly raise filecache, but after a while, these file pages\u0026rsquo; reference count drops, becoming inactive file pages, and memory can reclaim this portion. However, frequent data dispersion — such as when an index\u0026rsquo;s correlation approaches 0 (like a UUID primary key) — results in decent SQL performance but high reads, potentially generating frequent physical IO and loading too many pages into filecache. Even changes in business patterns can cause a large amount of shared buffer swapping in and out, significantly impacting performance.\nPG Processes and Shared Memory Mapping # #Without huge pages: /dev/zero (deleted) cat /proc/102208/smaps |egrep \u0026#34;rw\\-s\u0026#34; -A 1 2aefd8901000-2aefd8902000 rw-s 00000000 00:04 1202061313 /SYSV00001000 (deleted) Size: 4 kB -- 2aefd8918000-2aefd898f000 rw-s 00000000 00:13 4084862058 /dev/shm/PostgreSQL.1008001451 Size: 476 kB -- 2aefe2605000-2b00ad129000 rw-s 00000000 00:04 4084864418 /dev/zero (deleted) #With huge pages: /anon_hugepage (deleted) cat /proc/29091/smaps |egrep \u0026#34;rw\\-s\u0026#34; -A 1 2aaaaac00000-2ac3a2c00000 rw-s 00000000 00:0e 215471503 /anon_hugepage (deleted) Size: 104726528 kB -- 2b48dfe93000-2b48dfe94000 rw-s 00000000 00:04 88604727 /SYSV00001000 (deleted) Size: 4 kB -- 2b48dfeab000-2b48dff22000 rw-s 00000000 00:12 215515747 /dev/shm/PostgreSQL.1123685558 Size: 476 kB Child process page tables are all copied from the parent process; parent and child processes therefore share the same page frames[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]. So whether it\u0026rsquo;s the postmaster or backend processes (any process forked from postmaster), they all map the same shared memory address in their virtual memory — their addresses and Size in smaps are equal.\nWhy Do All PG Processes Have /dev/zero as the Largest Segment in Virtual Memory? # There are two main ways to implement anonymous page mapping with mmap: one is by setting the MAP_ANONYMOUS flag with fd=-1, and the other is by opening the /dev/zero device file and passing the resulting file descriptor to mmap. These two methods are functionally equivalent.\nPG shared buffers use the /dev/zero device mapping to implement anonymous shared pages, which is why you typically see PG processes having a large proportion of their virtual memory address space as /dev/zero.\nReferences # [Understanding the Linux Kernel]: Understanding the Linux Kernel: Memory Addressing, Memory Management, Address Space Management, Page Frame Reclamation\n[Understanding Linux Processes and Memory]: Understanding Linux Processes and Memory: CPU Hardware Principles, Process and Thread Comparison\n[Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition]: Running Linux Kernel: Beginner\u0026rsquo;s Guide 2nd Edition: System Calls, Memory Management\nhttps://www.cs.oslomet.no/~haugerud/os/Forelesning/os7.pdf\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.cs.unc.edu/~porter/courses/comp630/s24/slides/pfra.pdf\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.kernel.org/doc/gorman/html/understand/index.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://courses.cs.washington.edu/courses/cse333/20wi/lectures/07/CSE333-L07-posix_20wi.pdf\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.sohu.com/a/392831824_467784\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nredhat,Configuringanoperatingsystemtooptimizememoryaccess\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html#swappiness\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://access.redhat.com/solutions/6785021\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.kernel.org/doc/Documentation/vm/overcommit-accounting\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://carlyleliu.github.io/LinuxKernel/LinuxMemoryOptimization/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.man7.org/linux/man-pages/man5/proc_pid_oom_score.5.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://wiki.goframe.org/pages/viewpage.action?pageId=157646868\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.man7.org/conf/lca2019/cgroups_v2-LCA2019-Kerrisk.pdf\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://support.huaweicloud.com/usermanual-hce/hce_02_0072.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://chrisdown.name/talks/cgroupv2/cgroupv2-fosdem.pdf\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.cnblogs.com/muahao/p/10109712.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"Jun 19, 2025","externalUrl":null,"permalink":"/en/2025/06/19/linux-memory-advanced/","section":"Posts","summary":"(For memory basics, refer to Linux Memory Analysis; this article covers memory knowledge above that foundation)\nMemory Basic Concepts # buddy # The process of buddy system allocating and merging pages is omitted.\nEasily overlooked knowledge points:\nThe prerequisite for buddy merging two blocks of the same size is that their physical addresses are contiguous The merge algorithm is iterative: after merging at the current level, it will automatically attempt to merge larger blocks. This means compactd is not strictly required for merging page table \u0026 PTE # page table and PTE are actually two different concepts, and they are easily confused because both generally refer to page tables. Below is relevant knowledge about page table and PTE[^ 《深入理解Linux内核》 (Understanding the Linux Kernel)]\n","title":"Linux Memory Advanced","type":"posts"},{"content":" As a DBA # 2023 was a year of comprehensive PostgreSQL learning for me, and 2024 has been a year of comprehensive PostgreSQL operations. There\u0026rsquo;s actually a lot of material I really want to dive into but haven\u0026rsquo;t had the time. This year was mainly case analysis — I could only supplement my foundational knowledge here and there.\nMid-year there was a discussion about \u0026ldquo;will DBAs be eliminated in the cloud era.\u0026rdquo; This discussion left a deep impression on me. I thought about many things afterward — why do others seem to have so few things to deal with while I, as a DBA, have so much? I even went into cloud computing groups to debate about it, and I actually gained something from it. Different perspectives lead to unexpected conclusions. The conclusion of the debate may boil down to just one thing: DBAs are providing 1510 emotional value to their leaders.\nRight or wrong, you can see reflections on the DBA profession in many of my articles this year. Continuing down the traditional DBA path is certainly a dead end. Today\u0026rsquo;s DBAs lean more toward business data layer operations, or moving up to architecture design. Positions for expert DBAs focused purely on databases are actually very few.\nREADING # Let me reiterate why I\u0026rsquo;m so devoted to reading (I said this in 2023 too\u0026hellip;):\nThe value brought by reading is immeasurable in the short term Reading brings a pleasant sense of intellectual enrichment Learning is a belief. Yuval Harari has a view: believing in science is actually also a form of faith. I choose to believe in this faith, at least in 2024 and the foreseeable future. My book list roughly falls into three categories: PostgreSQL, broader technical scope, and extracurricular. Some are in Chinese, some in English. Some are physical books, some electronic.\nThis year, let me continue with a reading list ranking. Horizontal comparison across different categories is a bit of a stretch, so let\u0026rsquo;s compare within categories. Once again, note: these book lists are for books I aimed to \u0026ldquo;finish cover to cover.\u0026rdquo; Books used as references don\u0026rsquo;t count here.\n2024 PostgreSQL Book List (ranked by preference):\n\u0026ldquo;PostgreSQL Database Kernel Analysis\u0026rdquo; — clear thinking and framework, though the version is a bit old \u0026ldquo;Quickly Mastering PostgreSQL Version New Features\u0026rdquo; — this should be my favorite PostgreSQL book this year, because it has zero fluff throughout, a pleasure to read \u0026ldquo;The Internals of PostgreSQL\u0026rdquo; — I originally wanted to put this first, but since there\u0026rsquo;s a free online version at interdb, I wouldn\u0026rsquo;t even recommend buying this book. It\u0026rsquo;s ranked here because interdb is so excellent — its substitute is enshrined here as a deity \u0026ldquo;The Way of PostgreSQL: From Apprentice to Expert, 2nd Edition\u0026rdquo; — very detailed but also very long. I recommend skimming through quickly to find the key points without lingering too long \u0026ldquo;PostgreSQL Technical Internals: Transaction Processing Deep Dive\u0026rdquo; — transactions are the foundation of PostgreSQL, and also the foundation of my source code journey \u0026ldquo;PostgreSQL in Action\u0026rdquo; — the practical examples are well worth referencing \u0026ldquo;PostgreSQL 16 Administration Cookbook\u0026rdquo; — not recommended. The table of contents framework looks good, but the content is hollow. Don\u0026rsquo;t waste time on this book. 2024 Broader Technical Scope Book List (ranked by preference):\n\u0026ldquo;DDIA-v2: Designing Data-Intensive Applications (2nd Edition)\u0026rdquo; — so good I don\u0026rsquo;t know where to begin. So excellent that I specially wrote reading notes (my only book notes article this year). I wish I had found it sooner. \u0026ldquo;A Brief History of Databases\u0026rdquo; — reading history truly brings insight. The story of databases begins here. Some technical things become clearer in hindsight. \u0026ldquo;ITIL 4 and DevOps Service Management Certification Guide (2nd Edition)\u0026rdquo; — a classic in IT service management. It elevated my understanding of the operations role — how did these things so closely tied to my work come about? Which parts don\u0026rsquo;t match reality, and why weren\u0026rsquo;t they applied? You can grasp many things from it. \u0026ldquo;Cloud Native Kubernetes\u0026rdquo; — hardcore, another track entirely \u0026ldquo;Docker Deep Dive\u0026rdquo; — decent for understanding containers and container history. The container knowledge itself isn\u0026rsquo;t actually that much. \u0026ldquo;Brother Bird\u0026rsquo;s Linux Private Kitchen\u0026rdquo; — sorry, I genuinely hadn\u0026rsquo;t read this classic. Came to catch up. The writing approach is well worth learning from. The drawback is that much of it isn\u0026rsquo;t useful for my role. \u0026ldquo;Machine Learning\u0026rdquo; — ranked here not because the book is bad, but because it\u0026rsquo;s very hard to understand. I gave up about a quarter of the way through. This book showed me the upper limits of my intelligence, and I\u0026rsquo;m sad about it. \u0026ldquo;Building a Vector Database from Scratch\u0026rdquo; — if you want to read source code, go to GitHub \u0026ldquo;Deep Understanding of Go Language\u0026rdquo; — understood nothing at all 2024 Extracurricular Book List (ranked by preference):\n\u0026ldquo;Cancer Ward\u0026rdquo; — I finished this early in the first half of the year. While reading it, I felt: barring surprises, this book would rank first this year. Nobel Prize in Literature, well-deserved. \u0026ldquo;Intimate Relationships\u0026rdquo; — understanding relationships with lovers, friends, and bosses. Academic paper style, solid, I like it. \u0026ldquo;Does God Play Dice? A History of Quantum Physics\u0026rdquo; — setting aside everything else, the writing style provides immense emotional value, making me want to keep reading. I finished it in just a few days. \u0026ldquo;The Worlds I See\u0026rdquo; — AI pioneer Fei-Fei Li\u0026rsquo;s autobiography. The story of a girl who grew up in Chengdu venturing into the melting pot of America, eventually leading Google AI, while also narrating the history of AI development. \u0026ldquo;21 Lessons for the 21st Century\u0026rdquo; — the final installment of Yuval Harari\u0026rsquo;s trilogy. I loved the first two books, but this one felt just okay. At least it brought closure. \u0026ldquo;The Old Man and the Sea\u0026rdquo; — hard to evaluate. I like its temperament, but not its content. \u0026ldquo;The Wandering Earth\u0026rdquo; — this is a collection of Liu Cixin\u0026rsquo;s short stories. One day at the library, I bought it because of the first short story. After buying it, I found the other short stories to be very boring and childish. I felt cheated. \u0026ldquo;Journey to the West\u0026rdquo; — hot take: they can\u0026rsquo;t even explain Tang Sanzang\u0026rsquo;s background properly. A mess, completely confused. I gave up after a little bit. (My evaluation of \u0026ldquo;Romance of the Three Kingdoms\u0026rdquo; last year was very high.) Blog and WeChat Official Account # 2024 Published Articles:\nPostgreSQL technical: 21 Other technical: 2 Book notes: 1 Useless articles: 1 I only wrote 25 articles this year, a noticeable decrease from last year.\nWeChat Official Account followers: 600. Though not many, I believe every single one has good taste \u0026#x1f638;\nWriting technical articles is actually quite tiring — it takes far more time than one would imagine. However, you genuinely learn things during the writing process, and the sense of accomplishment from completing a piece is real. Since I feel responsible for my articles, I won\u0026rsquo;t write recklessly about things I don\u0026rsquo;t understand. As for errors arising from misunderstandings, that\u0026rsquo;s actually normal. No one can guarantee that their future self won\u0026rsquo;t criticize their current self — just write correctly for the current state.\nIn terms of writing content this year, I gave up writing reading notes for extracurricular books. I wrote quite a few last year, but writing reading notes takes a lot of time with very low value. Low emotional value tasks naturally get abandoned. In fact, my writing content varies each year. Currently, PostgreSQL database technical articles are the only constant — other types aren\u0026rsquo;t as stable. This is normal. The blog was originally meant for database writing. If there\u0026rsquo;s no application scenario for other domains, I won\u0026rsquo;t touch them again after the brief exploratory period.\nOne more complaint: domestic blogging platforms only care about article quantity, which is completely at odds with my writing style. Each of my articles is tens of thousands of hand-typed characters. I\u0026rsquo;m a quality-over-quantity blogger. So I can\u0026rsquo;t be bothered anymore — I\u0026rsquo;m planning to abandon CSDN in 2025 and just post on GitHub and my WeChat Official Account.\nI\u0026rsquo;ve been writing on CSDN since 2017. When I first started blogging, there weren\u0026rsquo;t many good blog hosting platforms. Looking at CSDN now: community interaction is zero, and the vast majority of articles on it are terrible. Even I don\u0026rsquo;t want to find CSDN articles myself. It\u0026rsquo;s like a first love of 7-8 years — sometimes you just have to break up.\n2024 Publication Channels:\nCSDN Blog: https://liuzhilong.blog.csdn.net Modb.pro: liuzhilong62 GitHub: https://github.com/liuzhilong62/blogs WeChat Official Account: 破斯特贵斯库儿 Expected 2025 Channels:\nGitHub: https://github.com/liuzhilong62/blogs WeChat Official Account: 破斯特贵斯库儿 Other platforms: we\u0026rsquo;ll see Final Thoughts # I seem to talk about work-learning balance every year\u0026hellip; Due to a dramatic increase in workload this year, there was even a period where I couldn\u0026rsquo;t study at all. Balance has been shattered. Not having time to study is unacceptable to me, so I later adjusted my daily schedule (thanks to \u0026ldquo;Atomic Habits\u0026rdquo; — I absolutely love this book), and finally managed to squeeze out some study time. Actually, as long as no one\u0026rsquo;s around, learning efficiency is high.\nI\u0026rsquo;ve collected some quotes I resonated with this year:\nDon\u0026rsquo;t let others become dependencies in your task chain \u0026ndash;heisenberg.liu Plans that require execution are generally simple plans \u0026ndash;heisenberg.liu Things not implemented equal things not done \u0026ndash;somebody Solve problems yourself instead of waiting for others to reply \u0026ndash;somebody Important things should be done immediately — waiting even a moment means they won\u0026rsquo;t get done \u0026ndash;somebody Don\u0026rsquo;t do repetitive low-value tasks. Think more about the context behind this requirement \u0026ndash;heisenberg.liu Don\u0026rsquo;t pan for gold in shit. Find ways to get quality information sources \u0026ndash;somebody SREs need the ability to configure optimal default parameters and the ability to modify these parameters in bulk \u0026ndash;\u0026ldquo;Enterprise Cloud Computing\u0026rdquo; The more miscellaneous tasks you do, the more miscellaneous tasks come your way \u0026ndash;heisenberg.liu SREs spend 50% of time on operations and 50% on development \u0026ndash;\u0026ldquo;Enterprise Cloud Computing\u0026rdquo; Premature optimization is the root of all evil. Premature code abstraction is also the root of all evil \u0026ndash;somebody The speed at which the human brain receives knowledge is limited \u0026ndash;somebody If someone won\u0026rsquo;t let you read, leave that person or leave that environment \u0026ndash;heisenberg.liu Teams that build knowledge bases are slackers \u0026ndash;somebody The value of a standard is determined by the customer \u0026ndash;\u0026ldquo;ITIL 4\u0026rdquo; Heroism: working long hours and troubleshooting alone. Long working hours also lead to burnout with the work itself. Those who want to be heroes are only interested in their own achievements and turn a deaf ear to team collaboration \u0026ndash;\u0026ldquo;ITIL 4\u0026rdquo; Not all problems need root cause analysis. It depends on the frequency of occurrence and the scope of the failure \u0026ndash;\u0026ldquo;ITIL 4\u0026rdquo; Looking back at the plans I set for myself in 2023: only 2 items total, and I completed neither. KPI achievement rate: 0% \u0026#x1f604;\nCombining agile operations, agile project management, and OKR thinking: setting a full-year plan for myself at the beginning of the year is simply unreasonable. Looking back at last year and the year before, some of my plans emerged mid-way and won priority battles over other tasks. And some tasks simply couldn\u0026rsquo;t be completed — this should be a normal state. So, I won\u0026rsquo;t set too many flags for myself.\n2025 Plan:\nContinue some things Think about how to produce output Master another track PostgreSQL\u0026hellip; haven\u0026rsquo;t figured out what more to do Find a way to resume fitness ","date":"Jan 11, 2025","externalUrl":null,"permalink":"/en/2025/01/11/my-2024-year-end-summary/","section":"Posts","summary":"As a DBA # 2023 was a year of comprehensive PostgreSQL learning for me, and 2024 has been a year of comprehensive PostgreSQL operations. There’s actually a lot of material I really want to dive into but haven’t had the time. This year was mainly case analysis — I could only supplement my foundational knowledge here and there.\nMid-year there was a discussion about “will DBAs be eliminated in the cloud era.” This discussion left a deep impression on me. I thought about many things afterward — why do others seem to have so few things to deal with while I, as a DBA, have so much? I even went into cloud computing groups to debate about it, and I actually gained something from it. Different perspectives lead to unexpected conclusions. The conclusion of the debate may boil down to just one thing: DBAs are providing 1510 emotional value to their leaders.\n","title":"My 2024 Year-End Summary","type":"posts"},{"content":"This article focuses on common PostgreSQL operations issues — rare edge cases that surface once every two or three years are out of scope.\nIt\u0026rsquo;s primarily a technical ops summary, aiming for clarity and quick applicability. Deep dives at the source-code level are deliberately avoided.\nSQL Performance \u0026amp; Execution Plans # Sudden Execution Plan Changes # PostgreSQL does not support optimizer hints natively, and the community has made it clear it never will. The PG community\u0026rsquo;s stance is roughly: \u0026ldquo;Our optimizer is perfect. If the current plan isn\u0026rsquo;t good enough, it\u0026rsquo;s because the developer doesn\u0026rsquo;t understand optimization.\u0026rdquo;\nRegardless of what the PG community thinks, sudden execution plan regressions happen all the time in production, and we don\u0026rsquo;t have the rich, native plan-binding mechanisms that Oracle provides. This is a real challenge for production operations. For example: one morning, a sensitive query suddenly changes its plan, runtime jumps from 0.1s to 1s, and due to some concurrency the database CPU gets hammered — the business notices immediately. Without plan-binding tools, our only two rapid recovery options are: 1) collect statistics, or 2) scale up CPU.\nA question about rapid recovery: does collecting statistics always help? A good DBA can identify where the optimizer went wrong, but can\u0026rsquo;t instantly conjure up a complete correct plan — especially for complex queries. Collecting statistics essentially hands the optimization problem back to the optimizer, trusting it to get it right. While this sounds a bit shaky, in PostgreSQL it actually works most of the time. (For scenarios where collecting stats is known to be useless, see the \u0026ldquo;ORDER BY LIMIT Problem\u0026rdquo; section.)\nWhy do execution plans suddenly change and regress?\nPlans are cost-based, costs rely on statistics, and statistics are always lagging Sufficiently complex SQL has a huge number of possible execution paths, and the optimizer picks the lowest-cost one PG exposes many optimizer parameters to tune for local hardware (e.g., seq_page_cost, effective_cache_size). These can nudge the optimizer\u0026rsquo;s preferences but are very low-level. While there\u0026rsquo;s theoretical tuning headroom, changing them has system-wide effects. After go-live, adjusting these is extremely high-risk. The very existence of these parameters hints that no plan can be 100% perfect, because the optimizer\u0026rsquo;s reasoning depends on its environment Even mighty Oracle, with its arsenal of plan-stabilization features, can\u0026rsquo;t guarantee 100% problem-free SQL — because SQL, data, statistics, bind variables, etc. are all dynamic.\nFor PG users, we\u0026rsquo;re not there yet, but we can work on making plans more stable:\nDon\u0026rsquo;t join too many tables. More tables mean more possible plans — to the point where PG GEQO stops enumerating all plans, reducing the chance of finding the optimal one Don\u0026rsquo;t write overly complex SQL. Keep in mind SQL may come from ORM frameworks rather than hand-written queries. Framework-generated SQL is often optimized for a goal with little regard for brevity or readability, making it very hard to tune Don\u0026rsquo;t create indexes indiscriminately — have a clear goal. Random indexes confuse the optimizer Tune per-table statistics collection thresholds via autovacuum_analyze_scale_factor (see \u0026ldquo;Delayed Statistics Collection\u0026rdquo;) Use pg_hint_plan to give the optimizer hints pg_hint_plan # pg_hint_plan is a third-party extension that uses hints to guide the optimizer toward the correct plan.\nWhat pg_hint_plan supports:\nSpecifying scan methods (e.g., index scan), join methods (NL/HASH/MERGE), join order, memoize, estimated row counts, parallelism, and GUC parameters Binding hints to SQL via hint_plan.hints without modifying the application SQL text pg_hint_plan limitations:\nUsage restrictions with subqueries, foreign tables, CTEs, views, PL/pgSQL, etc. compute_query_id treats hints as comments and ignores them Unknown bugs While this extension is actively maintained, I haven\u0026rsquo;t found large-scale production deployment cases yet. We\u0026rsquo;ve also encountered issues in limited production use where hints don\u0026rsquo;t take effect — possibly related to JDBC plan caching — but it\u0026rsquo;s hard to draw firm conclusions.\nIn short: pg_hint_plan is a good tool, but large-scale production deployment is still TBD. I recommend waiting and watching. You can trial it, but don\u0026rsquo;t become dependent on it.\nDelayed Statistics Collection # Statistics are the foundation of SQL optimization. PG statistics aren\u0026rsquo;t particularly complex, but many people still don\u0026rsquo;t fully understand them.\nThe three key views for PG statistics: pg_class, pg_stat_all_tables, pg_stats\n-- pg_class: pages and tuples select relname,relpages,reltuples::bigint from pg_class where relname=\u0026#39;lzlpg\u0026#39;\\gx -[ RECORD 1 ]------ relname | lzlpg relpages | 187501 reltuples | 6000032 -- pg_stat_all_tables: live tuples, dead tuples, last analyze time select relname,n_live_tup,n_dead_tup,last_analyze,last_autoanalyze from pg_stat_all_tables where relname=\u0026#39;lzlpg\u0026#39;\\gx -[ RECORD 1 ]----+------------------------------ relname | lzlpg n_live_tup | 6000032 n_dead_tup | 0 last_analyze | 2025-01-04 15:54:44.553057+08 last_autoanalyze | [null] -- pg_stats: per-column statistics — understand every field select * from pg_stats where tablename=\u0026#39;lzlpg\u0026#39; and attname=\u0026#39;a\u0026#39;\\gx -[ RECORD 1 ]----------+------- schemaname | public tablename | lzlpg attname | a inherited | f null_frac | 0 avg_width | 70 n_distinct | -1 most_common_vals | [null] most_common_freqs | [null] histogram_bounds | [null] correlation | [null] most_common_elems | [null] most_common_elem_freqs | [null] elem_count_histogram | [null] Stale statistics are very likely to cause execution plan changes and SQL performance issues. Check last_autovacuum and last_autoanalyze in pg_stat_all_tables to determine if collection is lagging.\nWhy tune it? Because the default autovacuum_analyze_scale_factor is 0.1, meaning statistics are only collected when data changes exceed 10%. For a 1-billion-row table, that\u0026rsquo;s 100 million rows — possibly far too infrequent.\nEvaluate whether to tune per-table autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor based on: whether it\u0026rsquo;s a core business table, number of joins, query complexity, access frequency, month-boundary issues, data skew, etc. The goal: increase collection frequency to reduce plan-regression risk without wasting resources on excessive vacuuming.\nWhat value should you set? An example:\nFor a monthly table (or monthly partition) with queries hitting the current day\u0026rsquo;s data: with autovacuum_analyze_scale_factor = 0.1, the table gets analyzed almost daily for the first ~10 days, but may skip analysis around day 12. At that point statistics can cross a boundary and plans may degrade. To ensure analysis continues through days 10–31 of the month, set autovacuum_analyze_scale_factor below 0.03. I recommend autovacuum_analyze_scale_factor = 0.02.\nParameter tuning reference (consider your table\u0026rsquo;s data model!):\nParameter Default Recommended autovacuum_vacuum_scale_factor 0.2 0.04 autovacuum_analyze_scale_factor 0.1 0.02 The Optimizer May Choose a Non-Primary-Key Index # Intuitively, a primary key should have the best selectivity, but the optimizer may still choose something else.\n-- Reproduction commands create table t1(a char(1000) primary key,b char(1000)); insert into t1 select md5(g::text),md5(g::text) from generate_series(1,10000) g; create index idxa on t1(a); create index idxb on t1(b); analyze t1; explain (analyze,buffers) select * from t1 where a=\u0026#39;qwer\u0026#39; and b=\u0026#39;qwer\u0026#39;; explain (analyze,buffers) select * from t1 where a=\u0026#39;qwer\u0026#39; and b||\u0026#39;\u0026#39;=\u0026#39;qwer\u0026#39;; -- Columns a and b have identical selectivity, but the optimizer picks the regular index, not the PK explain (analyze,buffers) select * from t1 where a=\u0026#39;qwer\u0026#39; and b=\u0026#39;qwer\u0026#39;; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Index Scan using idxb on t1 (cost=0.41..5.43 rows=1 width=2008) (actual time=0.045..0.046 rows=0 loops=1) Index Cond: (b = \u0026#39;qwer\u0026#39;::bpchar) Filter: (a = \u0026#39;qwer\u0026#39;::bpchar) Buffers: shared hit=3 -- Force the PK path — cost is only marginally higher explain (analyze,buffers) select * from t1 where a=\u0026#39;qwer\u0026#39; and b||\u0026#39;\u0026#39;=\u0026#39;qwer\u0026#39;; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Index Scan using idxa on t1 (cost=0.41..5.44 rows=1 width=2008) (actual time=0.079..0.079 rows=0 loops=1) Index Cond: (a = \u0026#39;qwer\u0026#39;::bpchar)` Filter: (((b)::text || \u0026#39;\u0026#39;::text) = \u0026#39;qwer\u0026#39;::text) Buffers: shared read=3 Even though columns a and b have the same type and selectivity, the optimizer picks the regular index over the PK. The PK path costs 0.01 more.\nWhy does this matter?\nWith the current data distribution, picking the regular index is harmless. But once data changes, the two index plans can diverge dramatically:\nalter table t1 set (autovacuum_enabled =\u0026#39;off\u0026#39;); insert into t1 select md5(g::text),\u0026#39;repeat\u0026#39; from generate_series(20001,30000) g; -- b=\u0026#39;repeat\u0026#39; has terrible selectivity, but the b index is still chosen explain (analyze,buffers) select * from t1 where a=\u0026#39;qwer\u0026#39; and b=\u0026#39;repeat\u0026#39;; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Index Scan using idxb on t1 (cost=0.41..5.43 rows=1 width=2008) (actual time=15.823..15.824 rows=0 loops=1) Index Cond: (b = \u0026#39;repeat\u0026#39;::bpchar) Filter: (a = \u0026#39;qwer\u0026#39;::bpchar) Rows Removed by Filter: 10000 Buffers: shared hit=2511 -- Compare with the PK plan explain (analyze,buffers) select * from t1 where a=\u0026#39;qwer\u0026#39; and b||\u0026#39;\u0026#39;=\u0026#39;repeat\u0026#39;; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Index Scan using idxa on t1 (cost=0.41..5.44 rows=1 width=2008) (actual time=0.041..0.041 rows=0 loops=1) Index Cond: (a = \u0026#39;qwer\u0026#39;::bpchar) Filter: (((b)::text || \u0026#39;\u0026#39;::text) = \u0026#39;repeat\u0026#39;::text) Buffers: shared hit=3 Even with poor real selectivity, the optimizer sticks with the regular index — but efficiency is far worse (shared hit=2511 vs. shared hit=3). For latency-sensitive queries or larger data volumes, this becomes a real production problem.\nSolutions:\nManually collect statistics; increase collection frequency Use pg_hint_plan Rewrite the SQL to prevent it from using the regular index The ORDER BY LIMIT Problem # ORDER BY with LIMIT is a well-known issue with plenty of write-ups and case studies online (see my post ORDER BY LIMIT 10 Is Slower Than ORDER BY LIMIT 100).\nThe root cause: the optimizer currently can\u0026rsquo;t estimate where data sits in the table relative to the index order. If matching rows happen to be near the end of the table, the scan reads far more data than expected before returning the LIMIT rows. Note this isn\u0026rsquo;t limited to ORDER BY + LIMIT — any operation involving sorted output + LIMIT can hit it: GROUP BY + LIMIT, DISTINCT + LIMIT, merge joins, etc.\nSolutions:\nRewrite the SQL: add an expression to prevent using the sort-column index (including PK), e.g., order by ''||col1 limit xxx Create a composite index: a composite index on (sort_column + index_column) may be chosen by the optimizer and is generally more efficient than an index on the sort column alone. This approach doesn\u0026rsquo;t require changing the SQL Table Bloat # Something Blocking Dead Tuple Cleanup # Putting aside autovacuum configuration issues and edge cases, the common blockers are:\nLong-running transactions. Note: a long transaction on a different table also blocks dead-tuple reclamation. Read-only queries cause this too. Replication slots. Lagging or defunct replication slots cause this. Both are relatively easy to solve: 1) terminate the long-transaction session, 2) drop the replication slot, or have the consumer analyze why consumption is so slow.\nHigh-Concurrency UPDATE Causing Table Bloat # Unlike something blocking vacuum, this is about dead tuples being generated faster than vacuum can clean them up. Typically, such tables show high pg_stat_all_tables.n_tup_upd. If table bloat requires repack, assess whether write volume is high enough to make repeated manual repack a losing game. In that case, tune the table/index fillfactor.\nFor the underlying principles, see this post From Painfully Slow Unique Index Scans to Index Bloat. I\u0026rsquo;ll summarize the conclusions here:\nfillfactor basics:\nfillfactor acts as a high-water mark for tables or indexes. During INSERT, once a page reaches its fillfactor line, new rows go to the next page. The purpose is to reserve space for UPDATEs so they don\u0026rsquo;t constantly seek new pages.\nWhile both tables and indexes have fillfactor with the same goal (accommodating UPDATEs), the details differ significantly:\nTables: If a page still has free space, an UPDATE can stay within the same page — no new page needed, no need to find another page with space. More importantly, thanks to PG\u0026rsquo;s HOT (Heap-Only Tuple) feature, in-page updates don\u0026rsquo;t touch indexes, naturally slowing index bloat Indexes: Different rows or out-of-page updates of the same row generate new index entries. Reserving space in index pages via fillfactor greatly reduces index page splits Of course, fillfactor settings are tightly coupled with the workload. If data is append-only like logs with zero updates, fillfactor=100 for both tables and indexes is perfectly fine. But most business tables see updates, so fillfactor shouldn\u0026rsquo;t be 100. With frequent UPDATEs, it should be even lower.\nYet PG\u0026rsquo;s defaults are:\nTable default: fillfactor=100 Index default: fillfactor=90 Recommended settings:\nalter table lzlpg set (fillfactor=60); alter index lzlpg_pkey set (fillfactor=70); -- These commands only affect new pages; existing pages need repack -- Repack: 1. Check for long transactions; resolve them first 2. nohup pg_repack -d lzldb --table lzlpg -p 6666 -no-kill-backend \u0026gt; pgrepack_lzlpg_log.log 2\u0026gt;\u0026amp;1 \u0026amp; Long Transaction Problems # Long transactions don\u0026rsquo;t have a huge amount of theory behind them — monitor and handle promptly — but they absolutely deserve their own section.\nLong transactions cause many problems:\nUnreleased locks → application blocking WAL not recycled → disk alerts Dead tuples not cleaned → SQL performance degradation Various other bizarre performance issues linked to long transactions \u0026hellip; Long transactions in PostgreSQL are far more damaging than in Oracle or MySQL. They must be strictly managed.\nSubtransaction Problems # \u0026ldquo;Subtransactions are basically cursed. Rip em out.\u0026rdquo;\nSubtransactions cause many problems and are a frequent pain point in the industry.\nIndustry experience reports:\nWaiting for Postgres 17: Configurable SLRU cache sizes for increased performance\nSubtransactions-overflow-and-the-performance-cliff\nWhy we spent the last month eliminating PostgreSQL subtransactions\nWhere subtransactions come from:\nPL/pgSQL functions containing a block with an exception clause savepoints JDBC + autosave=always (default autosave=never) ODBC Note: OGG uses an ODBC driver, and ODBC cannot disable subtransactions.\nGaussDB\u0026rsquo;s ODBC can disable subtransactions via ForExtensionConnector.\nSo we can advise applications to keep subtransactions under 64, but we can\u0026rsquo;t easily advise against using OGG, since migrating off Oracle often depends on OGG-based data sync tools.\nSubtransaction problem scenarios and symptoms:\n1(+) long transaction + subtransaction overflow + high concurrency → severe performance drop Subtransaction overflow (64+) → noticeable performance dip Subtransaction overflow (64+) + multixact → severe performance drop 1(+) long transaction + 1(+) subtransaction → severe query performance drop on read replicas Improvements in PG17:\nSLRU manages transaction relationships for clog, multixact, subtrans, etc. in shared memory. Relevant source definitions:\n/* Number of SLRU buffers to use for subtrans */ #define NUM_SUBTRANS_BUFFERS\t32 // 32 SLRU pages in shared memory /* * Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds * for non-aborted subtransactions of its current top transaction. These * have to be treated as running XIDs by other backends. * * We also keep track of whether the cache overflowed (ie, the transaction has * generated at least one subtransaction that didn\u0026#39;t fit in the cache). * If none of the caches have overflowed, we can assume that an XID that\u0026#39;s not * listed anywhere in the PGPROC array is not a running transaction. Else we * have to look at pg_subtrans. */ #define PGPROC_MAX_CACHED_SUBXIDS 64\t// Overflow at 64+, per backend PG17 SLRU improvements: New GUC parameter to configure SLRU slot count; split the existing single centralized SLRU lock into multiple bank locks.\nImprovement effect:\n(https://www.pgevents.ca/events/pgconfdev2024/sessions/session/53/slides/27/SLRU%20Performance%20Issues.pdf)\nSubtransaction handling strategies:\nDev standards: Don\u0026rsquo;t use savepoints; consider ON CONFLICT for write conflicts Dev standards: Don\u0026rsquo;t use exception blocks Dev standards: Ensure JDBC does not have autosave=always enabled Monitoring: Targeted monitoring of pg_stat_slru Monitoring: Targeted monitoring of SAVEPOINT and EXCEPTION CDC standards: Use ODBC (and OGG or other ODBC-based tools) with care; split transactions, cap subtransactions per large transaction at 50K Upgrade: Move to PG17 Concurrency \u0026amp; Performance # Snapshot and Concurrency Parameter Tuning # Parameter Type Default Recommended Requires Restart old_snapshot_threshold cpu -1 (community) -1 Yes max_parallel_workers_per_gather cpu 2 0 No old_snapshot_threshold easily causes performance problems when enabled — there\u0026rsquo;s plenty of material online. Even though it requires a restart, I strongly recommend keeping it disabled.\nmax_parallel_workers_per_gather auto-enables parallelism for large queries, but parallelism of 2 won\u0026rsquo;t give a proportional 2x speedup. This parameter is best used in specific scenarios, like explicitly setting parallel workers for batch jobs. Since no restart is needed, it\u0026rsquo;s a quick change.\nWill disabling old_snapshot_threshold cause problems?\nNo. This parameter exists to limit long transactions — which do damage performance in PG — but the parameter itself causes performance issues, defeating the purpose.\nLong transactions can be handled via several mechanisms:\nLong transaction monitoring. This is the most important, and monitoring is fairly mature. Set statement_timeout (default 0) Set transaction_timeout (default 0, available in PG17+) Set lock_timeout (default 0; recommended at session level for DDL) Set idle_in_transaction_session_timeout (default 0; we set it to 2h) Set idle_session_timeout (default 0; not relevant here) High-Concurrency Commits Causing LWLOCK:WALWrite # Case Study: Intermittent Slow INSERT \u0026hellip; VALUES\nKey takeaways:\nThere\u0026rsquo;s only one IO:WALWrite, but there can be dozens of LWLOCK:WALWrite waiters You can\u0026rsquo;t directly see the LWLOCK blocking chain, but from the source code we know LWLOCK:WALWrite is waiting on IO:WALWrite In high-concurrency small-transaction scenarios, increasing WAL buffer size theoretically doesn\u0026rsquo;t help much What problems does this cause?\nConcurrent writes block, write latency increases, active sessions may spike High-concurrency small transactions can\u0026rsquo;t saturate disk IO Solutions:\nDistribute concurrent writes across time Batch commits at the application level Analyze and try to reduce FPI (see FPI section) Group commit (TBD) WAL \u0026amp; Latency # FPI and Checkpoint Parameters # PG generates WAL FPI (Full Page Images) the first time a page is touched after a checkpoint. So more frequent checkpoints → higher probability of FPI.\nCheckpoint frequency is controlled by two parameters:\ncheckpoint_timeout max_wal_size Principle:\n(Egor Rogov, PostgreSQL 14 Internals)\nmax_wal_size defaults to 1GB, which is too small for high-load databases. Generally, you should increase this parameter to reduce FPI.\ncheckpoint_timeout defaults to 5 minutes, which seems reasonable.\nFPI and Random Writes # Even with longer checkpoint intervals, FPI problems may persist. Check whether the workload involves UUID-based random writes. You may need to switch to sequences or another UUID scheme.\nFinding the specific index:\nCheck if FPI is severe --stats=record is handy\npg_waldump -z --stats=record 00000001000001860000001B Sort which relations have the most FPWs pg_waldump 00000001000001860000001B|grep FPW|awk -F \u0026#39;:\u0026#39; \u0026#39;{print $7}\u0026#39;|awk \u0026#39;{print $2}\u0026#39;|sort -n|uniq -c |sort -r|head -10 Logical Replication \u0026amp; Replication Slots # Logical replication has many issues and is a key optimization area for the community — nearly every major version brings significant improvements.\nLogical Replication and Replication Slots Basics\nSpill Problem # Analysis of PG Startup Logic and Spill-Induced Slow Startup\nSpill key takeaways:\nSpill occurs when logical decoding can\u0026rsquo;t fit transaction data in memory, so it writes to disk. Spill files contain transaction information Each walsender has independent decoding, so each logical replication subscriber has its own spill Large transactions produce large spill files, typically few in number Subtransaction spill produces one spill file per subtransaction Versions:\nPG12 and earlier: hard-coded 4096 changes PG13 added logical_decoding_work_mem to adjust memory and reduce spill probability PG14+ supports streaming replication Streaming also requires certain conditions to trigger, so even with streaming, spilling can still occur PG17 added debug_logical_replication_streaming to force streaming WALSender Blocking Shutdown # PG Shutdown Logic and WALSender Blocking Shutdown Analysis\nIn reality, any process that doesn\u0026rsquo;t exit can block shutdown. The question is which ones are most likely to cause trouble. From the shutdown code flow, archiver and walsender are frequent blockers because during shutdown they attempt a final archive or log transmission.\nIf shutdown is stuck on walsender, try kill (not kill -9) — the checkpoint hasn\u0026rsquo;t finished yet, and a forced shutdown leaves an inconsistent state. Even for forced shutdown, prefer pg_ctl stop -D $PGDATA -m i over raw kill -9 If shutdown is stuck on archiver, kill -9 is fine — the checkpoint is already complete and the database is in a consistent state Partitioned Tables # Partitioned Table Basics\nPG\u0026rsquo;s partitioned tables have unique characteristics that developers generally don\u0026rsquo;t fully understand without study, leading to many pitfalls.\nIndex Mismatch Between Parent and Child Partitions # Due to non-standard partition creation, many indexes are created directly on child tables (which should not be done), and the \u0026ldquo;create index on all children + attach\u0026rdquo; workflow is skipped. The result: the parent table has no index or no effective index. Since the parent has no data, this doesn\u0026rsquo;t directly impact queries — but when new partitions are created, they only inherit the parent\u0026rsquo;s indexes, so new child tables end up missing indexes.\nFixing parent-table missing indexes is fairly straightforward: see The Correct Way to Create Partition Indexes\n-- Create an invalid index ONLY on the parent. Fast, but blocks subsequent DML — watch for long transactions CREATE INDEX IDX_DATECREATED ON ONLY lzlpartition1(date_created); -- Create the index CONCURRENTLY on each child partition. Slow, but doesn\u0026#39;t block DML — watch for long DML transactions that could cause the operation to fail create index concurrently idx_datecreated_202302 on lzlpartition1_202302(date_created); -- Attach all indexes. Fast, no business blocking ALTER INDEX idx_datecreated ATTACH PARTITION idx_datecreated_202302; Fixing a missing primary key on the parent is harder: see Adding Primary Keys and Unique Indexes to Partitioned Tables\nAdding a primary key on the parent acquires AccessExclusiveLock, blocking everything. Creating an index on a partitioned table is slow, and the PK then causes further blocking. There\u0026rsquo;s currently no low-impact way to add a PK on a partitioned table. Workarounds: \u0026ldquo;attach a unique index + NOT NULL constraint\u0026rdquo;, schedule extended downtime for the partition table while the index builds, or use a third-party sync tool to populate a new table that already has the PK.\nAbusing the DEFAULT Partition # Default Partition Overgrowth Causing Prolonged Blocking During CREATE TABLE ... PARTITION OF\nThe root cause is simple: when adding a new partition, the DDL must validate that data in the DEFAULT partition doesn\u0026rsquo;t conflict with the new partition\u0026rsquo;s range. This scans a large amount of data in the DEFAULT partition, and the new partition creation never completes. Blocking then cascades — business queries and writes stall.\nDEFAULT partition abuse is a widespread problem! The community PG doesn\u0026rsquo;t provide interval partitioning. If a developer forgets to create a partition, data silently lands in DEFAULT with no error or alert. Day after day, the DEFAULT partition grows enormous — and then the next schema change causes an outage.\nYou can\u0026rsquo;t leave an oversized DEFAULT partition as-is forever. Even though ATTACH can avoid the blocking problem, you still need to defuse this bomb eventually.\nDEFAULT partition data handling — Plan 1:\nDETACH the default partition, create proper partitions, then re-insert DEFAULT data into the partitioned table If needed, after detach and creating proper partitions, create an empty DEFAULT partition to maintain business continuity Note: DETACH (unlike ATTACH) requires an AccessExclusiveLock on the parent. PG14 supports DETACH CONCURRENTLY, but not for DEFAULT partitions DEFAULT partition data handling — Plan 2:\nDETACH the default partition, create proper partitions, then ATTACH the detached DEFAULT table as a regular child partition — careful with range boundaries If needed, after detach and creating proper partitions, create an empty DEFAULT partition to maintain business continuity Note: DETACH (unlike ATTACH) requires an AccessExclusiveLock on the parent. PG14 supports DETACH CONCURRENTLY, but not for DEFAULT partitions DEFAULT partition data handling — Plan 3:\nCreate a new table, sync all data via DTS Rename tables Plan 3 looks the crudest, but it\u0026rsquo;s the one I personally recommend most. If you have 5 instances to fix, a surgical approach is fine. If you have 200 instances, the labor cost makes DTS the practical winner.\nMissing SELECT Privileges on Partitions Causing Abnormal Plans # If a user lacks SELECT privilege on a child partition, their queries can\u0026rsquo;t access that partition\u0026rsquo;s statistics, leading to bad execution plans. Partitions created via CREATE TABLE ... PARTITION OF normally don\u0026rsquo;t carry SELECT grants — but data is accessible through the parent — so this is a widespread issue.\nSolutions:\nHave the cloud platform handle it automatically Enforce dev standards requiring SELECT grants on child partitions High-Concurrency Full Partition Scans and LWLock:lockmanager # This is another very common problem!\nI recommend reading the AWS documentation, which explains it clearly: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/wait-event.lw-lock-manager.html\nSymptoms:\nSpiking active sessions Severe LWLock:lockmanager wait events Database performance cliff Trigger conditions:\nQuery scans multiple partitions That query has high concurrency Key takeaways:\nThe fastpath lock mechanism is designed for quick access to \u0026ldquo;weak locks\u0026rdquo;, improving database concurrency fastpath works for lock levels ≤ 3 — i.e., SELECT, SELECT FOR xxx, and DML (lock modes below ShareUpdateExclusiveLock — levels 1, 2, 3 can use fastpath). It\u0026rsquo;s meant to benefit normal operations FP_LOCK_SLOTS_PER_BACKEND: a local process holds at most 16 fastpath locks; beyond that, it must acquire locks in shared memory, producing LWLock:lockmanager Not just tables — every accessed index also acquires a lock This problem isn\u0026rsquo;t tightly coupled to partition count — even a modest number of partitions can trigger LWLock:lockmanager and degrade performance Let\u0026rsquo;s calculate: with a partitioned table having 1 primary key and 2 regular indexes, how many partitions exhaust the fastpath?\n16 / (3 indexes + 1 table) - 1 parent = 3 child partitions Yes — a full scan across just 3 partitions can already trigger LWLock:lockmanager waits.\nFor a regular table, 16 indexes would similarly exhaust fastpath.\nSolutions:\nFor not-too-large tables, merge partitions into a regular table Add partition key filter conditions to queries Reduce indexes (not very practical, since partition count alone can exceed 16) The hard part:\nIn Oracle-to-PG migrations, Oracle supports global indexes, so primary keys and unique indexes don\u0026rsquo;t need to include the partition key. In PG, they must include the partition key.\nPK example:\nidxlzl(primarykey) --oracle idxlzl(primarykey,partitionkey) --pg A common query pattern:\nselect col from tlzl where primarykey=12345; Should you push the application to add a partition filter here? It\u0026rsquo;s a tough sell. The resistance is: \u0026ldquo;I already passed the primary key — what more do you want? If I knew everything, why would I query the database?\u0026rdquo;\nIn this case, the only recommendation is to convert the partitioned table to a regular table. I haven\u0026rsquo;t found a better solution.\nMemory # Excessive Objects Leading to Oversized relcache # Key takeaways:\nrelcache stores relation metadata: OID, pg_class info, partitions, subtransactions, row-level security policies, statistics, index metadata, access methods, etc. Each session has its own (rel)cache for system catalog data (metadata, etc.) Normally this cache is small. When the catalog is huge and a session accesses all of it, the cache can become very large Cache management is simple: no eviction mechanism, no limit (though there are invalidation messages) Closing the session releases the cache Solutions:\nReduce the number of objects — especially check whether partition child tables are excessive Set aggressive connection-pool disconnection parameters so business connections recycle more frequently Memory Fragmentation # Recommended commands:\ncat /proc/meminfo|grep whatyouneed cat /proc/buddyinfo ## cgroup memory /opt/cgtools/cginfo -t perf -s mem # Pay attention to pgscand/s (direct memory reclaim) — values in the tens of thousands indicate a problem sar -B -s \u0026#34;08:00:00\u0026#34; -e \u0026#34;09:00:00\u0026#34; # min_free_kbytes setting: cat /proc/sys/vm/min_free_kbytes # Total physical memory usage of all processes: grep Pss /proc/[1-9]*/smaps | awk \u0026#39;{total+=$2}; END {printf \u0026#34;%d kB\\n\u0026#34;, total }\u0026#39; # PSS memory for a specific process: cat /proc/90875/smaps |grep Pss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; # RSS memory for a specific process: cat /proc/68729/smaps |grep Rss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; # Private memory for a specific process: cat /proc/90875/smaps|sed \u0026#39;/zero/,/VmFlags/d\u0026#39; |grep Private |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; min_free_kbytes:\n(https://vivani.net/2022/06/14/linux-kernel-tuning-page-allocation-failure/)\nWhen free memory is low, the kswapd daemon is woken to free pages:\npages_low: when free pages fall below pages_low, buddy allocator wakes kswapd and the kernel begins swapping pages to disk pages_min: when free pages reach pages_min, reclamation pressure is high — the zone urgently needs free pages. The allocator performs synchronous kswapd work, sometimes called direct reclaim pages_high: once kswapd is awake and freeing pages, the kernel considers the zone \u0026ldquo;balanced\u0026rdquo; only when free pages reach pages_high. At pages_high, kswapd goes back to sleep. Free pages above pages_high means the zone is in an ideal state vm.min_free_kbytes (the pages_min watermark) is an extremely important OS parameter. Too low a value prevents effective memory reclamation, potentially causing system crashes and service interruptions. Too high a value increases reclaim activity, causing allocation delays that can immediately trigger OOM.\nOptimization results:\nAfter increasing min_free_kbytes + deploying off-peak drop-cache jobs, problems have decreased significantly.\nWhy increase min_free_kbytes?\nThis is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size.\nSource: kernel.org docs\nThe point of raising min_free_kbytes isn\u0026rsquo;t to raise the min watermark and trigger direct reclaim more often — it\u0026rsquo;s because the low watermark couldn\u0026rsquo;t be tuned before Linux 7. The only way to raise low proportionally was to raise min, making asynchronous reclaim trigger earlier and giving direct reclaim a buffer window.\nRed Hat 8 added two memory parameters to improve reclaim: watermark_scale_factor can raise watermarks without touching min_free_kbytes.\nRecommend enabling huge pages:\nHuge pages perform better when PG requests contiguous memory Huge pages also help reduce page cache size shared_buffers can use huge pages; requires Huge_pages=on and OS-level huge pages enabled Instances with huge pages enabled in production show better performance and fewer problems AWS huge pages standard: enabled by default for all instance classes except certain test tiers, and cannot be disabled Huge_pages parameter is turned on by default for all DB instance classes other than t3.medium, db.t3.large, db.t4g.medium, db.t4g.large instance classes. You can\u0026rsquo;t change the huge_pages parameter value or turn off this feature in the supported instance classes of Aurora PostgreSQL.\ncgroup and Host Memory Mismatch # When cgroup memory hits its limit, kswapd prioritizes reclaiming pages within the cgroup. With cloud VM instance types and cgroup configurations, the host may have free memory above watermarks while the cgroup is under pressure. The host-level pages_low doesn\u0026rsquo;t trigger asynchronous reclaim for either host or cgroup memory. Eventually, direct reclaim fires to satisfy the cgroup\u0026rsquo;s DB memory demand.\nThe root cause: cgroups lack independent free-page memory management.\nThe only fix: increase the cgroup memory limit, overcommitting the host more aggressively so the host reaches pages_low sooner.\nshared_buffer and pagecache # PG uses a double-buffer mechanism — no direct IO yet.\nDouble buffer: DB shared_buffers (one layer of shared memory) + OS pagecache (another layer). In real deployments, pagecache is typically far larger than shared_buffers. And pagecache counts against cgroup mem but isn\u0026rsquo;t reflected in cgroup memory monitoring\u0026hellip;\nBottom line: leave plenty of memory for pagecache. Don\u0026rsquo;t make shared_buffers excessively large (20GB seems sufficient for most cases). Only increase it if you clearly observe buffer-mapping-related wait events.\nwork_mem Cannot Cap Hash Join / Hash Aggregate Memory # hash_mem_multiplier limits memory for hash-based operations (hash join, hash agg, etc.), capping at hash_mem_multiplier * work_mem. The default is 2.\nBefore PG13, work_mem was tunable, but there was no way to limit how many hash operations a single query could use. PG13 added this multiplier. In other words, pre-13, it was very hard to cap hash-table memory.\nIn a PG12- production environment, I found a single session consuming 300GB of memory — the culprit was the lack of hash-table limits combined with a plan that incorrectly used hash tables.\nOther Issues # Exclusive Backup and Startup Issues # Normally, when the database stops and restarts, the startup position comes from pg_controldata\u0026rsquo;s LSN. But if there\u0026rsquo;s a backup_label file in PGDATA, the startup LSN is read from backup_label.\nWhat problems does this cause?\nDisk snapshots taken directly on the data directory may include the label file. If the database is large and the backup took a long time, restart can be very slow Bigger problem: after a production shutdown from certain causes, restart takes forever. The root cause is the startup LSN coming from the backup rather than controldata Version changes:\nPG13:\npg_start_backup() pg_stop_backup()\nSupports exclusive and non-exclusive modes; exclusive is the default. Exclusive mode creates backup_label in the data directory at start and cleans it at stop. Non-exclusive mode doesn\u0026rsquo;t create the label at start; it returns the label info at stop.\nPG15:\npg_backup_start() pg_backup_stop()\nFunction names changed, and exclusive backup mode was removed. No backup_label is written at backup start; instead it\u0026rsquo;s written to the backup area at backup stop.\npg_stat_activity Unqueryable # Symptom:\npg_stat_activity hangs and can\u0026rsquo;t be queried.\npstack at the time:\n#0 pgstat_read_current_status () at pgstat.c:3642 #1 0x0000000000727181 in pgstat_read_current_status () at pgstat.c:2788 #2 pgstat_fetch_stat_numbackends () at pgstat.c:2789 #3 0x000000000083f2ee in pg_stat_get_activity (fcinfo=0x25c2d98) at pgstatfuncs.c:575 #4 0x000000000065058f in ExecMakeTableFunctionResult (setexpr=0x25b1d28, econtext=0x25b1c48, argContext=\u0026lt;optimized out\u0026gt;, expectedDesc=0x2545218, randomAccess=false) at execSRF.c:234 #5 0x00000000006609dc in FunctionNext (node=node@entry=0x25b1b38) at nodeFunctionscan.c:94 #6 0x000000000065110c in ExecScanFetch (recheckMtd=0x660700 \u0026lt;FunctionRecheck\u0026gt;, accessMtd=0x660720 \u0026lt;FunctionNext\u0026gt;, node=0x25b1b38) at execScan.c:133 Analysis:\nThe code location is clear — stuck in an infinite loop after st_changecount becomes odd.\nTriggers: OOM (reproducible), abnormal backend exit (possible), terminate (maybe). None of these guarantee the issue, though.\nCommunity thread didn\u0026rsquo;t reach a conclusion. Currently the trigger probability appears low.\nSolution: restart the database.\nConnection and Connection Pooling Issues # IO Error Messages # IO errors typically mean the application is using a connection that\u0026rsquo;s already been closed. This happens often, and diagnosing it is difficult because the entire chain involves many components and broad domain knowledge. Here\u0026rsquo;s a brief summary.\nKnown active-disconnection scenarios:\nhikari maxLifetime Symptom: session lifetime matches the parameter. Possible cause: the application holds an explicit transaction with an uncommitted SELECT, the pool closes the session, and the app gets io error; could not rollback or similar.\npg.datasouce.maxLifetime druid timeout Symptom: connection drops after SQL execution exceeds 20s.\nspring.datasource.dynamic.druid.socketTimeout=20000 spring.datasource.dynamic.druid.connectTimeout=20000 Change to: spring.datasource.socketTimeout=3600000 spring.datasource.connectTimeout=3600000 Application Horizontal Scaling vs. Database Connection Limits # Horizontal application scaling meets PG connection bottlenecks:\nHikariCP is now Spring Boot\u0026rsquo;s default connection pool. With the proliferation of Spring Boot and microservices, HikariCP usage is widespread. Every pod scaled out increases database connection count. The maximumPoolSize stays the same per pod, but more nodes mean more total connections. From existing node count, added node count, and current total connections, you can proportionally calculate how many idle connections will be added.\nApplications can scale horizontally without state, but databases cannot. PG\u0026rsquo;s connection limit is max_connections. Unchecked application scaling can saturate idle connections. Tuning max_connections is painful because it requires a database restart.\nPG connection upper limit:\nAlso, even with unlimited horizontal scaling, max_connections should adjust with instance class — but there\u0026rsquo;s a real ceiling. In any database, idle connections degrade performance as they increase.\nRefer to AWS\u0026rsquo;s approach: max_connections is tied to instance class, with a maximum of 5000, LEAST({DBInstanceClassMemory/9531392}, 5000). This reduces manual connection ops and provides a reasonable ceiling.\n","date":"Jan 8, 2025","externalUrl":null,"permalink":"/en/2025/01/08/postgresql-ops-experience-2024/","section":"Posts","summary":"This article focuses on common PostgreSQL operations issues — rare edge cases that surface once every two or three years are out of scope.\nIt’s primarily a technical ops summary, aiming for clarity and quick applicability. Deep dives at the source-code level are deliberately avoided.\nSQL Performance \u0026 Execution Plans # Sudden Execution Plan Changes # PostgreSQL does not support optimizer hints natively, and the community has made it clear it never will. The PG community’s stance is roughly: “Our optimizer is perfect. If the current plan isn’t good enough, it’s because the developer doesn’t understand optimization.”\n","title":"PostgreSQL Ops Experience 2024","type":"posts"},{"content":" Walsender Blocking Shutdown Symptoms # Production shutdown log output:\n2024-12-06 17:00:02.036 CST,,,447560,,65693cde.6d448,1320,,2023-12-01 09:54:38 CST,,0,LOG,00000,\u0026#34;received fast shutdown request\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:00:02.295 CST,,,447560,,65693cde.6d448,1322,,2023-12-01 09:54:38 CST,,0,LOG,00000,\u0026#34;background worker \u0026#34;\u0026#34;logical replication launcher\u0026#34;\u0026#34; (PID 448996) exited with exit code 1\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:00:10.627 CST,,,448990,,65693ce0.6d9de,213833,,2023-12-01 09:54:40 CST,,0,LOG,00000,\u0026#34;checkpoint complete: wrote 426844 buffers (5.1%); 0 WAL file(s) added, 0 removed, 5 recycled; write=91.427 s, sync=0.055 s, total=91.508 s; sync files=761, longest=0.028 s, average=0.001 s; distance=2197531 kB, estimate=2680783 kB\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;checkpointer\u0026#34; 2024-12-06 17:00:10.628 CST,,,448990,,65693ce0.6d9de,213834,,2023-12-01 09:54:40 CST,,0,LOG,00000,\u0026#34;shutting down\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;checkpointer\u0026#34; ... --checkpointer finished checkpoint and is in shutting down state, pm has not exited --160s later pm receives immediate shutdown, triggered by health check script 2024-12-06 17:02:43.348 CST,,,447560,,65693cde.6d448,1323,,2023-12-01 09:54:38 CST,,0,LOG,00000,\u0026#34;received immediate shutdown request\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:02:43.370 CST,\u0026#34;logicaluser\u0026#34;,\u0026#34;lzldb\u0026#34;,283840,\u0026#34;10.33.77.159:39865\u0026#34;,6751a2dc.454c0,7,\u0026#34;idle\u0026#34;,2024-12-05 20:55:56 CST,89/847309655,0,WARNING,57P02,\u0026#34;terminating connection because of crash of another server process\u0026#34;,\u0026#34;The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\u0026#34;,\u0026#34;In a moment you should be able to reconnect to the database and repeat your command.\u0026#34;,,,,,,,\u0026#34;Debezium Streaming\u0026#34;,\u0026#34;walsender\u0026#34; 2024-12-06 17:02:43.370 CST,\u0026#34;logicaluser\u0026#34;,\u0026#34;lzldb\u0026#34;,157641,\u0026#34;10.33.77.159:39407\u0026#34;,67408354.267c9,7,\u0026#34;idle\u0026#34;,2024-11-22 21:12:52 CST,9/3193590104,0,WARNING,57P02,\u0026#34;terminating connection because of crash of another server process\u0026#34;,\u0026#34;The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\u0026#34;,\u0026#34;In a moment you should be able to reconnect to the database and repeat your command.\u0026#34;,,,,,,,\u0026#34;Debezium Streaming\u0026#34;,\u0026#34;walsender\u0026#34; 2024-12-06 17:02:43.370 CST,\u0026#34;logicaluser\u0026#34;,\u0026#34;lzldb\u0026#34;,157916,\u0026#34;10.33.77.159:57038\u0026#34;,67408356.268dc,7,\u0026#34;idle\u0026#34;,2024-11-22 21:12:54 CST,115/3293293502,0,WARNING,57P02,\u0026#34;terminating connection because of crash of another server process\u0026#34;,\u0026#34;The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\u0026#34;,\u0026#34;In a moment you should be able to reconnect to the database and repeat your command.\u0026#34;,,,,,,,\u0026#34;Debezium Streaming\u0026#34;,\u0026#34;walsender\u0026#34; 2024-12-06 17:02:43.370 CST,\u0026#34;repuser\u0026#34;,\u0026#34;\u0026#34;,164392,\u0026#34;30.151.40.19:41641\u0026#34;,66b25869.28228,3,\u0026#34;streaming 42D3B/1732C5F0\u0026#34;,2024-08-07 01:07:53 CST,296/0,0,WARNING,57P02,\u0026#34;terminating connection because of crash of another server process\u0026#34;,\u0026#34;The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\u0026#34;,\u0026#34;In a moment you should be able to reconnect to the database and repeat your command.\u0026#34;,,,,,,,\u0026#34;standby_6666\u0026#34;,\u0026#34;walsender\u0026#34; 2024-12-06 17:02:43.371 CST,,,447560,,65693cde.6d448,1324,,2023-12-01 09:54:38 CST,,0,LOG,00000,\u0026#34;archiver process (PID 448994) exited with exit code 2\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:02:43.371 CST,\u0026#34;logicaluser\u0026#34;,\u0026#34;lzldb\u0026#34;,57755,\u0026#34;10.33.77.159:38918\u0026#34;,67125534.e19b,7,\u0026#34;idle\u0026#34;,2024-10-18 20:31:48 CST,243/902018192,0,WARNING,57P02,\u0026#34;terminating connection because of crash of another server process\u0026#34;,\u0026#34;The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\u0026#34;,\u0026#34;In a moment you should be able to reconnect to the database and repeat your command.\u0026#34;,,,,,,,\u0026#34;Debezium Streaming\u0026#34;,\u0026#34;walsender\u0026#34; 2024-12-06 17:02:43.372 CST,\u0026#34;logicaluser\u0026#34;,\u0026#34;lzldb\u0026#34;,157915,\u0026#34;10.33.77.159:43433\u0026#34;,67408356.268db,7,\u0026#34;idle\u0026#34;,2024-11-22 21:12:54 CST,60/3248014863,0,WARNING,57P02,\u0026#34;terminating connection because of crash of another server process\u0026#34;,\u0026#34;The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\u0026#34;,\u0026#34;In a moment you should be able to reconnect to the database and repeat your command.\u0026#34;,,,,,,,\u0026#34;Debezium Streaming\u0026#34;,\u0026#34;walsender\u0026#34; --pm finished shutting down 2024-12-06 17:02:57.534 CST,,,447560,,65693cde.6d448,1325,,2023-12-01 09:54:38 CST,,0,LOG,00000,\u0026#34;database system is shut down\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:03:49.536 CST,,,211844,,6752bdf3.33b84,1,,2024-12-06 17:03:47 CST,,0,LOG,00000,\u0026#34;ending log output to stderr\u0026#34;,,\u0026#34;Future log output will go to log destination \u0026#34;\u0026#34;csvlog\u0026#34;\u0026#34;.\u0026#34;,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 17:00:02 postmaster receives fast shutdown\n17:00:10 checkpoint completed, checkpointer stopped\n17:02:43 postmaster receives immediate shutdown\n17:02:43 1 physical and 5 logical replication walsenders stopped\n17:02:57 postmaster stopped\n17:03:49 postmaster receives startup task\nFrom the above, it\u0026rsquo;s clear that walsender was blocking the shutdown.\nShutdown and Signals # Before diving into source code, we need to understand signals and signal registration in PG.\nCommon Signals in PG # OS signals:\n$ kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP ... Common signals used in PG:\n-1 or -SIGHUP: Hangup signal. In PG, typically tells the process to reload configuration. -2 or -SIGINT: Interrupt signal (usually Ctrl+C). In PG, usually corresponds to cancel command. -3 or -SIGQUIT: In PG, usually means forced exit (die). -9 or -SIGKILL: Unconditional termination signal. -15 or -SIGTERM: Termination signal, the signal used by pg_terminate_backend. In PG, usually means graceful exit. -10 or -SIGUSR1: Custom signal. -12 or -SIGUSR2: Custom signal. -17 or SIGCHLD: Signal used by the pm process. When a child process exits, pm receives this signal to trigger child process reaping. The specific meaning of signals registered by each type of PG process can be found by reading the respective process source code.\nShutdown Defined by pg_ctl # There are several ways to shut down a PG database. At the bottom level, they all boil down to sending a signal to the postmaster process.\nsignal pg_ctl Meaning SIGTERM Smart Shutdown Disallow new connections, but allow existing sessions to finish their work normally. Only shuts down after all sessions terminate. SIGINT Fast Shutdown Server disallows new connections and sends SIGTERM to all existing child processes, aborting current transactions and exiting quickly. Waits for almost all child processes (some are not needed) to exit, then shuts down. SIGQUIT Immediate Shutdown Sends SIGQUIT to all child processes and waits for them to terminate. If any child process has not terminated within 5 seconds, they are sent SIGKILL. Note: pg_ctl has no parameter for sending SIGKILL (kill -9), but you can send SIGKILL directly to pm — though it\u0026rsquo;s definitely not recommended. When sending SIGKILL to pm, pm won\u0026rsquo;t do any cleanup of child processes, shared memory, or semaphores. Since SIGQUIT to pm has fallback logic for SIGKILL-ing child processes, SIGQUIT to pm basically guarantees pm will stop.\nIn the source code, there are only 3 shutdown states, corresponding to shutdown modes:\n/* Startup/shutdown state */ #define\tNoShutdown\t0 #define\tSmartShutdown\t1 #define\tFastShutdown\t2 #define\tImmediateShutdown\t3 These states appear frequently in shutdown routine source code, generally checked via the Shutdown variable:\nShutdown \u0026gt;= FastShutdown pm Signals # When pm receives the corresponding signal, it handles it accordingly:\nvoid PostmasterMain(int argc, char *argv[]) {... pqsignal_pm(SIGHUP, SIGHUP_handler);\t/* reread config file and have * children do same */ pqsignal_pm(SIGINT, pmdie); /* send SIGTERM and shut down */ pqsignal_pm(SIGQUIT, pmdie);\t/* send SIGQUIT and die */ pqsignal_pm(SIGTERM, pmdie);\t/* wait for children and shut down */ pqsignal_pm(SIGALRM, SIG_IGN);\t/* ignored */ pqsignal_pm(SIGPIPE, SIG_IGN);\t/* ignored */ pqsignal_pm(SIGUSR1, sigusr1_handler);\t/* message from child process */ pqsignal_pm(SIGUSR2, dummy_handler);\t/* unused, reserve for children */ pqsignal_pm(SIGCHLD, reaper);\t/* handle child termination */ pmdie: The three shutdown signals call the pmdie function. pmdie is the key shutdown function, analyzed in detail below. reaper: During shutdown, handles child process exit cleanup. When a child process exits, it sends SIGCHLD to pm, which enters reaper to clean up the child. Each child process cleanup has its own logic — for instance, normal exit of the checkpointer process checks whether archiver and walsender have completed their respective tasks. sigusr1, sigusr2: sigusr1_handler is the standard routine for SIGUSR1. Each child process handles SIGUSR1 differently. SIGUSR2 is entirely custom per child process; some child processes don\u0026rsquo;t even register this signal. Walsender Signals # When a child process is forked, it first registers signals.\nWalSndSignals registers signals for the walsender process:\n/* Set up signal handlers */ void WalSndSignals(void) { /* Set up signal handlers */ pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, StatementCancelHandler);\t/* query cancel */ pqsignal(SIGTERM, die);\t/* request shutdown */ pqsignal(SIGQUIT, quickdie);\t/* hard crash time */ InitializeTimeouts();\t/* establishes SIGALRM handler */ pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); pqsignal(SIGUSR2, WalSndLastCycleHandler);\t/* request a last cycle and * shutdown */ } Note SIGUSR1 and SIGUSR2.\nCheckpointer Signals # CheckpointerMain registers checkpointer signals:\nvoid CheckpointerMain(void) { ... //checkpointer blocks SIGTERM, the actual stop signal is SIGUSR2 pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */ pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */ pqsignal(SIGQUIT, SignalHandlerForCrashExit); pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, procsignal_sigusr1_handler); pqsignal(SIGUSR2, SignalHandlerForShutdownRequest); Note SIGUSR1 and SIGUSR2, and also note that checkpointer does not register SIGTERM.\nShutdown Source Code Analysis # pm Signal Handling and State Machine # The pmdie function handles different postmaster signals, including SIGCHLD sent by child processes to pm and shutdown signals sent by pg_ctl. The main logic of pm signal handling is converting the signal into a pmState state machine state transition, then entering PostmasterStateMachine for processing.\npmdie:\n/* * pmdie -- signal handler for processing various postmaster signals. */ static void pmdie(SIGNAL_ARGS) { int\tsave_errno = errno; ... switch (postgres_signal_arg) { case SIGTERM://Smart Shutdown ... if (pmState == PM_RUN) connsAllowed = ALLOW_SUPERUSER_CONNS; ... //smart shutdown does not process pmstate, hands directly to state machine //at this point normal pmState = PM_RUN PostmasterStateMachine(); break; case SIGINT://Fast Shutdown ... else if (pmState == PM_RUN || pmState == PM_HOT_STANDBY) { /* Report that we\u0026#39;re about to zap live client sessions */ ereport(LOG, (errmsg(\u0026#34;aborting any active transactions\u0026#34;))); pmState = PM_STOP_BACKENDS; } //Fast Shutdown transitions pmstate to PM_STOP_BACKENDS //then hands to state machine PostmasterStateMachine(); break; case SIGQUIT://Immediate Shutdown ... TerminateChildren(SIGQUIT);//abort all children with SIGQUIT, wait for them to exit pmState = PM_WAIT_BACKENDS; /* set stopwatch for them to die */ AbortStartTime = time(NULL); //Immediate Shutdown transitions pmstate to PM_WAIT_BACKENDS //process children before entering state machine //first interrupt children with SIGQUIT, wait for them to exit //then use SIGKILL on remaining children //finally non-consistent exit PostmasterStateMachine(); break; } ... } Before entering the state machine handler, let\u0026rsquo;s look at the postmaster states:\ntypedef enum { PM_INIT,\t/* postmaster starting */ PM_STARTUP,\t/* waiting for startup subprocess */ PM_RECOVERY,\t/* in archive recovery mode */ PM_HOT_STANDBY,\t/* in hot standby mode */ PM_RUN,\t/* normal \u0026#34;database is alive\u0026#34; state */ PM_STOP_BACKENDS,\t/* need to stop remaining backends */ PM_WAIT_BACKENDS,\t/* waiting for live backends to exit */ PM_SHUTDOWN,\t/* waiting for checkpointer to do shutdown * ckpt */ PM_SHUTDOWN_2,\t/* waiting for archiver and walsenders to * finish */ PM_WAIT_DEAD_END,\t/* waiting for dead_end children to exit */ PM_NO_CHILDREN\t/* all important children have exited */ } PMState; Since shutdown normally happens from the running state, we only need to focus on states at PM_RUN and below.\nPostmasterStateMachine execution has a sequential logic:\n/* * Advance the postmaster\u0026#39;s state machine and take actions as appropriate * * This is common code for pmdie(), reaper() and sigusr1_handler(), which * receive the signals that might mean we need to change state. */ static void PostmasterStateMachine(void) { //smart shutdown, pmState should be PM_RUN at this point if (pmState == PM_RUN || pmState == PM_HOT_STANDBY) { ... if (connsAllowed == ALLOW_NO_CONNS) { //After all normal backends exit, transition pmState to PM_STOP_BACKENDS if (CountChildren(BACKEND_TYPE_NORMAL) == 0) pmState = PM_STOP_BACKENDS; } } //PM_STOP_BACKENDS stops some core child processes, some will continue running //autovacuum, bgwriter, walwriter, startup, walreceiver will stop //walsender, checkpointer, archiver, stats, and syslogger will keep running //smart shutdown later phase enters this logic, fast shutdown enters directly if (pmState == PM_STOP_BACKENDS) { ...\t//Note this line about walsender! /* Signal all backend children except walsenders */ SignalSomeChildren(SIGTERM, BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); /* and the bgwriter too */ if (BgWriterPID != 0) signal_child(BgWriterPID, SIGTERM); /* and the walwriter too */ if (WalWriterPID != 0) signal_child(WalWriterPID, SIGTERM); /* If we\u0026#39;re in recovery, also stop startup and walreceiver procs */ if (StartupPID != 0) signal_child(StartupPID, SIGTERM); if (WalReceiverPID != 0) signal_child(WalReceiverPID, SIGTERM); /* checkpointer, archiver, stats, and syslogger may continue for now */ //Transition pmState from PM_STOP_BACKENDS to PM_WAIT_BACKEND //PM_WAIT_BACKEND means waiting for backends to exit pmState = PM_WAIT_BACKENDS; } /* * If we are in a state-machine state that implies waiting for backends to * exit, see if they\u0026#39;re all gone, and change state if so. */ // //smart shutdown, fast shutdown later phase enters this logic //immediate shutdown when entering state machine, directly enters this logic if (pmState == PM_WAIT_BACKENDS) { //During crash recovery and immediate shutdown, checkpointer needs proper exit //archiver, stats, and syslogger don\u0026#39;t need handling since they don\u0026#39;t touch shared memory //Walsenders also don\u0026#39;t need handling; they exit after checkpoint record is written, just like archiver if (CountChildren(BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND) == 0 \u0026amp;\u0026amp; StartupPID == 0 \u0026amp;\u0026amp; WalReceiverPID == 0 \u0026amp;\u0026amp; BgWriterPID == 0 \u0026amp;\u0026amp; (CheckpointerPID == 0 || (!FatalError \u0026amp;\u0026amp; Shutdown \u0026lt; ImmediateShutdown)) \u0026amp;\u0026amp; WalWriterPID == 0 \u0026amp;\u0026amp; AutoVacPID == 0) { if (Shutdown \u0026gt;= ImmediateShutdown || FatalError) { //ImmediateShutdown waits for dead end processes to finish pmState = PM_WAIT_DEAD_END; /* * We already SIGQUIT\u0026#39;d the archiver and stats processes, if * any, when we started immediate shutdown or entered * FatalError state. */ } else { //smart, fast shutdown goes here //regular child processes have all exited, now notify checkpointer to do shutdown checkpoint Assert(Shutdown \u0026gt; NoShutdown); //If checkpointer process doesn\u0026#39;t exist, start one if (CheckpointerPID == 0) CheckpointerPID = StartCheckpointer(); /* And tell it to shut down */ if (CheckpointerPID != 0) { //Send SIGUSR2 to Checkpointer //pmState = PM_SHUTDOWN signal_child(CheckpointerPID, SIGUSR2); pmState = PM_SHUTDOWN; } else { //Failing to start Checkpointer is a serious problem FatalError = true; pmState = PM_WAIT_DEAD_END; /* Kill the walsenders, archiver and stats collector too */ //Comment says kill walsender, but it actually doesn\u0026#39;t; at least not via SIGQUIT SignalChildren(SIGQUIT); if (PgArchPID != 0) signal_child(PgArchPID, SIGQUIT); if (PgStatPID != 0) signal_child(PgStatPID, SIGQUIT); } } } } //The pmdie function and state machine function won\u0026#39;t create PM_SHUTDOWN_2 state, but reaper will //When reaper handles checkpointer exit, it sets pmState = PM_SHUTDOWN_2; at the end of reaper, it enters the state machine function, which is here if (pmState == PM_SHUTDOWN_2) { /* * PM_SHUTDOWN_2 state ends when there\u0026#39;s no other children than * dead_end children left. There shouldn\u0026#39;t be any regular backends * left by now anyway; what we\u0026#39;re really waiting for is walsenders and * archiver. */ //PM_SHUTDOWN_2 essentially waits for walsender and archiver //only changes pmState if (PgArchPID == 0 \u0026amp;\u0026amp; CountChildren(BACKEND_TYPE_ALL) == 0) { pmState = PM_WAIT_DEAD_END; } } if (pmState == PM_WAIT_DEAD_END) { //PM_WAIT_DEAD_END means BackendList is completely empty if (dlist_is_empty(\u0026amp;BackendList) \u0026amp;\u0026amp; PgArchPID == 0 \u0026amp;\u0026amp; PgStatPID == 0) { /* These other guys should be dead already */ Assert(StartupPID == 0); Assert(WalReceiverPID == 0); Assert(BgWriterPID == 0); Assert(CheckpointerPID == 0); Assert(WalWriterPID == 0); Assert(AutoVacPID == 0); /* syslogger is not considered here */ pmState = PM_NO_CHILDREN; } } //PM_NO_CHILDREN is the last shutdown state, meaning normal shutdown can proceed if (Shutdown \u0026gt; NoShutdown \u0026amp;\u0026amp; pmState == PM_NO_CHILDREN) { if (FatalError) { ereport(LOG, (errmsg(\u0026#34;abnormal database system shutdown\u0026#34;))); //Abnormal pm exit ExitPostmaster(1); } ... //Normal pm exit ExitPostmaster(0); } } ... } reaper is the process reaping function. When a child process exits, it sends SIGCHLD to pm, and pm cleans up the process via the reaper function. Each process type — backend, startup, checkpointer, etc. — has its own cleanup flow.\nHere we only look at checkpointer cleanup. Also, reaper has no cleanup logic for walsender:\nif (pid == CheckpointerPID) { CheckpointerPID = 0; //Checkpointer exited normally, and pmState is PM_SHUTDOWN: waiting for checkpoint completion if (EXIT_STATUS_0(exitstatus) \u0026amp;\u0026amp; pmState == PM_SHUTDOWN) { /* * OK, we saw normal exit of the checkpointer after it\u0026#39;s been * told to shut down. We expect that it wrote a shutdown * checkpoint. (If for some reason it didn\u0026#39;t, recovery will * occur on next postmaster start.) * * At this point we should have no normal backend children * left (else we\u0026#39;d not be in PM_SHUTDOWN state) but we might * have dead_end children to wait for. * * If we have an archiver subprocess, tell it to do a last * archive cycle and quit. Likewise, if we have walsender * processes, tell them to send any remaining WAL and quit. */ Assert(Shutdown \u0026gt; NoShutdown); //Wake archiver for the last time if (PgArchPID != 0) signal_child(PgArchPID, SIGUSR2); //pgarch SIGUSR2=pgarch_waken_stop //Wake walsender for the last time SignalChildren(SIGUSR2);//walsender SIGUSR2=WalSndLastCycleHandler //Here PM_SHUTDOWN_2 is set //At this point Checkpointer has exited normally; we should wait for pgarch and walsender to finish their last task //This is PM_SHUTDOWN_2 state pmState = PM_SHUTDOWN_2; ... } else { //checkpointer abnormal exit is considered a crash HandleChildCrash(pid, exitstatus, _(\u0026#34;checkpointer process\u0026#34;)); } continue; } ... //At the end reaper still enters the state machine function PostmasterStateMachine(); ... } Checkpointer and Walsender Process Exit # Checkpointer main loop handling requests and shutdown:\nvoid CheckpointerMain(void) { /* * Loop forever */ for (;;) { bool\tdo_checkpoint = false; int\tflags = 0; pg_time_t\tnow; int\telapsed_secs; int\tcur_timeout; /* Clear any already-pending wakeups */ ResetLatch(MyLatch); /* * Process any requests or signals received recently. */ //Process recent sync requests and signals AbsorbSyncRequests(); HandleCheckpointerInterrupts(); Checkpointer shutdown function:\n/* * Process any new interrupts. */ static void HandleCheckpointerInterrupts(void) { ... if (ShutdownRequestPending) { /* * From here on, elog(ERROR) should end with exit(1), not send control * back to the sigsetjmp block above */ ExitOnAnyError = true; ShutdownXLOG(0, 0);//This writes the shutdown checkpoint proc_exit(0);//Normal exit code 0 } } Checkpointer exit needs to wait for ShutdownXLOG to complete.\nShutdownXLOG:\n/* * This must be called ONCE during postmaster or standalone-backend shutdown */ void ShutdownXLOG(int code, Datum arg) { ... //Here\u0026#39;s the checkpointer \u0026#34;shutting down\u0026#34; log, usually always seen ereport(IsPostmasterEnvironment ? LOG : NOTICE, (errmsg(\u0026#34;shutting down\u0026#34;))); /* * Signal walsenders to move to stopping state. */ //Initialize walsender stopping WalSndInitStopping(); //Wait for all walsenders to be in stopping state WalSndWaitStopping(); if (RecoveryInProgress()) CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); else { /* * If archiving is enabled, rotate the last XLOG file so that all the * remaining records are archived (postmaster wakes up the archiver * process one more time at the end of shutdown). The checkpoint * record will go to the next XLOG file and won\u0026#39;t be archived (yet). */ if (XLogArchivingActive() \u0026amp;\u0026amp; XLogArchiveCommandSet()) RequestXLogSwitch(false); //This is the shutdown checkpoint creation function CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); } ShutdownCLOG(); ShutdownCommitTs(); ShutdownSUBTRANS(); ShutdownMultiXact(); } Checkpointer notifies all walsenders to begin stopping:\n/* * Signal all walsenders to move to stopping state. * * This will trigger walsenders to move to a state where no further WAL can be * generated. See this file\u0026#39;s header for details. */ void WalSndInitStopping(void) { int\ti; for (i = 0; i \u0026lt; max_wal_senders; i++) { WalSnd\t*walsnd = \u0026amp;WalSndCtl-\u0026gt;walsnds[i]; pid_t\tpid; SpinLockAcquire(\u0026amp;walsnd-\u0026gt;mutex); pid = walsnd-\u0026gt;pid; SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); if (pid == 0) continue; SendProcSignal(pid, PROCSIG_WALSND_INIT_STOPPING, InvalidBackendId); } } Walsender receives the signal via the SendProcSignal function, with signal SIGUSR1:\n/* * SendProcSignal *\tSend a signal to a Postgres process * * Providing backendId is optional, but it will speed up the operation. * * On success (a signal was sent), zero is returned. * On error, -1 is returned, and errno is set (typically to ESRCH or EPERM). * * Not to be confused with ProcSendSignal */ int SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId) { else { /* * BackendId not provided, so search the array using pid. We search * the array back to front so as to reduce search overhead. Passing * InvalidBackendId means that the target is most likely an auxiliary * process, which will have a slot near the end of the array. */ int\ti; for (i = NumProcSignalSlots - 1; i \u0026gt;= 0; i--) { slot = \u0026amp;ProcSignal-\u0026gt;psh_slot[i]; if (slot-\u0026gt;pss_pid == pid) { /* the above note about race conditions applies here too */ /* Atomically set the proper flag */ slot-\u0026gt;pss_signalFlags[reason] = true; /* Send signal */ return kill(pid, SIGUSR1); } } } errno = ESRCH; return -1; } Walsender\u0026rsquo;s SIGUSR1 registration:\npqsignal(SIGUSR1, procsignal_sigusr1_handler); pqsignal(SIGUSR2, WalSndLastCycleHandler);\t/* request a last cycle and * shutdown */ sigusr1 classifies handling by signal reason:\n/* * procsignal_sigusr1_handler - handle SIGUSR1 signal. */ void procsignal_sigusr1_handler(SIGNAL_ARGS) { ... if (CheckProcSignal(PROCSIG_WALSND_INIT_STOPPING)) HandleWalSndInitStopping(); ... } The handler for PROCSIG_WALSND_INIT_STOPPING is HandleWalSndInitStopping:\n/* * Handle PROCSIG_WALSND_INIT_STOPPING signal. */ void HandleWalSndInitStopping(void) { Assert(am_walsender); /* * If replication has not yet started, die like with SIGTERM. If * replication is active, only set a flag and wake up the main loop. It * will send any outstanding WAL, wait for it to be replicated to the * standby, and then exit gracefully. */ if (!replication_active) kill(MyProcPid, SIGTERM); else got_STOPPING = true;//If walsender is active, initstopping just sets a flag for the main loop to handle } The \u0026ldquo;main loop\u0026rdquo; mentioned in the comment is somewhat ambiguous. Walsender has a main loop ServerLoop, but in reality only the loop in WalSndWaitForWal has checks for got_STOPPING.\nThe WalSndWaitForWal function is the main loop for walsender waiting for new WAL records. Since WAL records are initially generated in memory, walwriter flushes them based on certain conditions, not all the time. WalSndWaitForWal compares the currently sent LSN with the flushed LSN to determine whether new WAL needs to be sent. In other words, unflushed WAL is not transmitted; only flushed WAL is passed downstream.\nWalSndWaitForWal code segment about stopping:\n/* * Wait till WAL \u0026lt; loc is flushed to disk so it can be safely sent to client. * * Returns end LSN of flushed WAL. Normally this will be \u0026gt;= loc, but * if we detect a shutdown request (either from postmaster or client) * we will return early, so caller must always check. */ static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc) { ... for (;;) { ... //After receiving got_STOPPING, do one flush of WAL //This is necessary! Because walwriter may have already shut down at this point, WAL may not be flushed yet if (got_STOPPING) XLogBackgroundFlush(); /* Update our idea of the currently flushed position. */ if (!RecoveryInProgress()) RecentFlushPtr = GetFlushRecPtr(); else RecentFlushPtr = GetXLogReplayRecPtr(NULL); //Break out of the for loop //After getting new RecentFlushPtr, still need to send if (got_STOPPING) break; ... } /* reactivate latch so WalSndLoop knows to continue */ SetLatch(MyLatch); return RecentFlushPtr; } Back to walsender main loop: WalSndLoop(XLogSendLogical):\n/* Main loop of walsender process that streams the WAL over Copy messages. */ static void WalSndLoop(WalSndSendDataCallback send_data) { ... for (;;) { /* Clear any already-pending wakeups */ ResetLatch(MyLatch); ... //Process replies from downstream ProcessRepliesIfAny(); /* * If we have received CopyDone from the client, sent CopyDone * ourselves, and the output buffer is empty, it\u0026#39;s time to exit * streaming. */ //Exit loop when streaming is done if (streamingDoneReceiving \u0026amp;\u0026amp; streamingDoneSending \u0026amp;\u0026amp; !pq_is_send_pending()) break; //If output buffer has pending data, send it if (!pq_is_send_pending()) send_data(); else WalSndCaughtUp = false; /* Try to flush pending output to the client */ if (pq_flush_if_writable() != 0) WalSndShutdown();//Downstream not writable, downstream closed, normal walsender shutdown, exit code 0 /* If nothing remains to be sent right now ... */ if (WalSndCaughtUp \u0026amp;\u0026amp; !pq_is_send_pending()) { /* * If we\u0026#39;re in catchup state, move to streaming. This is an * important state change for users to know about, since before * this point data loss might occur if the primary dies and we * need to failover to the standby. The state change is also * important for synchronous replication, since commits that * started to wait at that point might wait for some time. */ //Data transmission is done, but commit info still needs to be sent if (MyWalSnd-\u0026gt;state == WALSNDSTATE_CATCHUP) { ereport(DEBUG1, (errmsg(\u0026#34;\\\u0026#34;%s\\\u0026#34; has now caught up with upstream server\u0026#34;, application_name))); WalSndSetState(WALSNDSTATE_STREAMING); } //Received SIGUSR2, meaning shutdown checkpoint is done. //Send the shutdown checkpoint record, wait for completion, then exit if (got_SIGUSR2) WalSndDone(send_data);//exit code 0 } ... } } Let\u0026rsquo;s return to checkpointer\u0026rsquo;s ShutdownXLOG logic. The above only analyzed WalSndInitStopping(). After this signal is sent to walsender, WalSndWaitStopping executes to wait for walsender.\nAs long as any walsender hasn\u0026rsquo;t exited, this is an infinite loop that won\u0026rsquo;t return:\n/* * Wait that all the WAL senders have quit or reached the stopping state. This * is used by the checkpointer to control when the shutdown checkpoint can * safely be performed. */ void WalSndWaitStopping(void) { for (;;) { int\ti; bool\tall_stopped = true; for (i = 0; i \u0026lt; max_wal_senders; i++) { WalSnd\t*walsnd = \u0026amp;WalSndCtl-\u0026gt;walsnds[i]; SpinLockAcquire(\u0026amp;walsnd-\u0026gt;mutex); if (walsnd-\u0026gt;pid == 0) { SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); continue; } if (walsnd-\u0026gt;state != WALSNDSTATE_STOPPING) { all_stopped = false; SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); break; } SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); } /* safe to leave if confirmation is done for all WAL senders */ if (all_stopped) return; pg_usleep(10000L);\t/* wait for 10 msec */ } } Finally, combined with the comments in walsender.c:\n* If the server is shut down, checkpointer sends us * PROCSIG_WALSND_INIT_STOPPING after all regular backends have exited. If * the backend is idle or runs an SQL query this causes the backend to * shutdown, if logical replication is in progress all existing WAL records * are processed followed by a shutdown. Otherwise this causes the walsender * to switch to the \u0026#34;stopping\u0026#34; state. In this state, the walsender will reject * any further replication commands. The checkpointer begins the shutdown * checkpoint once all walsenders are confirmed as stopping. When the shutdown * checkpoint finishes, the postmaster sends us SIGUSR2. This instructs * walsender to send any outstanding WAL, including the shutdown checkpoint * record, wait for it to be replicated to the standby, and then exit. After all regular backends have exited, checkpointer sends PROCSIG_WALSND_INIT_STOPPING to walsenders Walsender may enter the stopping state Only after all walsenders enter stopping state does checkpointer perform the shutdown checkpoint After the shutdown checkpoint completes, pm sends SIGUSR2 to walsender, which sends any remaining WAL including the shutdown checkpoint record itself, waits for standby to complete, then exits Shutdown Flow Diagram # After going through the source code, it felt like I understood but also didn\u0026rsquo;t — needed a shutdown flowchart to clarify.\nSummary of the fast shutdown flow:\n(High resolution: https://www.processon.com/view/link/6778a73a04a8344b9502637a)\nPG manages shutdown logic through signals, per-process main loops, PM state machine, and the pmdie process reaping function Also note: signals themselves are asynchronous. If you need to wait for the result of signal processing in a target process, you typically need other synchronization mechanisms (pipes, semaphores, shared memory, etc.). PG mainly relies on process dependencies and whether processes exit normally to determine if signals were properly handled. pgarch and walsender are treated as the same type of process, handled differently from others (walwriter, bgwriter). pgarch and walsender need to do an additional \u0026ldquo;last task\u0026rdquo;. The signal for the \u0026ldquo;last task\u0026rdquo; is typically defined as SIGUSR2. Checkpointer\u0026rsquo;s normal exit depends on pgarch and walsender exiting normally. pgarch\u0026rsquo;s last task is the final archive. So archiving can affect shutdown. Walsender\u0026rsquo;s second-to-last task is delivering the final WAL, and its last task is delivering the checkpoint shutdown info. These tasks require downstream reply messages, so walsender can affect shutdown. Test Reproduction # Test: Reproducing Walsender Blocking Shutdown # After fast stop shutdown, walsender can block the shutdown.\nTested various scenarios to reproduce walsender blocking shutdown. Currently, the following conditions together make it easier to trigger abnormal shutdown:\nOne walsender for publication/subscription One walsender for DTS Large number of subtransactions causing replication slot spill This three-in-one scenario doesn\u0026rsquo;t represent the only scenario; it\u0026rsquo;s just one that was easier to reproduce after testing many.\n--Reproduction commands (not extremely stable reproduction) 1.Create table --pg create table lzlpg(id bigserial primary key,a char(2000),b char(2000),c char(2000)); --oracle create table lzl.lzloracle(id number primary key ,a char(2000),b char(2000),c char(2000)) tablespace FADATA; 2.Set up 2 logical replication links (1 pub/sub, 1 DTS to oracle) 3.Reduce logical_decoding_work_mem logical_decoding_work_mem=1MB 4.Write large amounts of data (recommended: subtransaction spill) --Insert one row at a time, each insert as a subtransaction echo \u0026#34;begin;\u0026#34;\u0026gt;subtx.sql for i in {1..500000} do echo \u0026#34;savepoint p$i;\u0026#34;\u0026gt;\u0026gt;subtx.sql echo \u0026#34;insert into lzlpg(column1,column2,column3) select \u0026#39;a\u0026#39;,\u0026#39;b\u0026#39;,\u0026#39;c\u0026#39;;\u0026#34;\u0026gt;\u0026gt;subtx.sql done nohup psql -d lzl -f subtx.sql \u0026amp; 5.Stop the database before writing completes pg_ctl stop -D $PGDATA -m fast At this point, with fast shutdown, the database is in an incomplete shutdown state:\n~/lzl/slot]$ ps -axjf|grep 110402 150696 64964 64961 146782 pts/42 64961 S+ 6001 0:00 \\_ grep --color=auto 110402 1 110402 110402 110402 ? -1 Ss 6001 0:00 /myhost/postgres/base/rasesql1.5.6/bin/postgres -D /myhost/pg8094/data 110402 110599 110599 110599 ? -1 Ss 6001 0:00 \\_ postgres: lzlpg: logger 110402 117803 117803 117803 ? -1 Ss 6001 0:00 \\_ postgres: lzlpg: checkpointer 110402 117807 117807 117807 ? -1 Ss 6001 0:00 \\_ postgres: lzlpg: stats collector 110402 118563 118563 118563 ? -1 Rs 6001 3:29 \\_ postgres: lzlpg: walsender lzl 127.0.0.1(62971) idle 110402 222918 222918 222918 ? -1 Rs 6001 2:59 \\_ postgres: lzlpg: walsender dtssync 30.181.46.203(57218) idle Walsender, checkpointer, postmaster are all still there; logger and stats haven\u0026rsquo;t exited either.\nThe control file state is in production: meaning running in production, indicating the local shutdown checkpoint by checkpointer didn\u0026rsquo;t complete:\n~/lzl/slot]$ pg_controldata|grep -i state Database cluster state: in production Checkpointer stack:\npstack 117803 #0 0x00002b879fe0b983 in __select_nocancel () from /lib64/libc.so.6 #1 0x00000000008fd04a in pg_usleep (microsec=microsec@entry=10000) at pgsleep.c:56 #2 0x00000000007610c8 in WalSndWaitStopping () at walsender.c:3209 #3 0x000000000051fa86 in ShutdownXLOG (code=code@entry=0, arg=arg@entry=0) at xlog.c:8596 #4 0x00000000007215ff in HandleCheckpointerInterrupts () at checkpointer.c:566 #5 CheckpointerMain () at checkpointer.c:343 ... At this point, checkpointer is stuck in WalSndWaitStopping, meaning checkpointer is waiting for walsender processes to enter stopping state.\nWalsender stack at this point:\n#0 0x00000000007484fb in ReorderBufferLargestTXN (rb=\u0026lt;optimized out\u0026gt;) at reorderbuffer.c:2345 #1 ReorderBufferCheckMemoryLimit (rb=0x2b8808b94118) at reorderbuffer.c:2390 #2 ReorderBufferQueueChange (rb=0x2b8808b94118, xid=\u0026lt;optimized out\u0026gt;, lsn=1676456602544, change=change@entry=0x2b87a229f408) at reorderbuffer.c:649 #3 0x000000000073ec99 in DecodeTruncate (buf=\u0026lt;optimized out\u0026gt;, buf=\u0026lt;optimized out\u0026gt;, ctx=\u0026lt;optimized out\u0026gt;) at decode.c:872 #4 DecodeHeapOp (buf=0x7ffda7d35180, ctx=0x2b87a224b118) at decode.c:455 #5 LogicalDecodingProcessRecord (ctx=0x2b87a224b118, record=\u0026lt;optimized out\u0026gt;) at decode.c:126 #6 0x000000000075f502 in XLogSendLogical () at walsender.c:2886 #7 0x0000000000761822 in WalSndLoop (send_data=send_data@entry=0x75f4c0 \u0026lt;XLogSendLogical\u0026gt;) at walsender.c:2287 ... Walsender is stuck in the transaction spill function. (Why it\u0026rsquo;s stuck is still unclear!!!)\nCheckpointer process is blocked in WalSndWaitStopping:\n/* * Wait that all the WAL senders have quit or reached the stopping state. This * is used by the checkpointer to control when the shutdown checkpoint can * safely be performed. */ void WalSndWaitStopping(void) { for (;;) { int\ti; bool\tall_stopped = true; for (i = 0; i \u0026lt; max_wal_senders; i++) { WalSnd\t*walsnd = \u0026amp;WalSndCtl-\u0026gt;walsnds[i]; SpinLockAcquire(\u0026amp;walsnd-\u0026gt;mutex); if (walsnd-\u0026gt;pid == 0) { SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); continue; } if (walsnd-\u0026gt;state != WALSNDSTATE_STOPPING) { all_stopped = false; SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); break; } SpinLockRelease(\u0026amp;walsnd-\u0026gt;mutex); } /* safe to leave if confirmation is done for all WAL senders */ if (all_stopped) return; pg_usleep(10000L);\t/* wait for 10 msec */ } } From the code and stack, it\u0026rsquo;s clear the condition walsnd-\u0026gt;state != WALSNDSTATE_STOPPING is hit, causing the infinite loop.\nTest: Handling the Mid-Shutdown State # The above is an awkward mid-shutdown state. Besides kill -9, there are other better ways to achieve consistent shutdown:\nSolution 1: Shut down the downstream process Solution 2: Send SIGTERM to walsender Solution 1 test:\nWhen the downstream exits, walsender will also exit:\nstatic void ProcessRepliesIfAny(void) {... /* * \u0026#39;X\u0026#39; means that the standby is closing down the socket. */ case \u0026#39;X\u0026#39;: proc_exit(0); For pub/sub, execute the following on the subscriber side; even if the upstream is in mid-shutdown state, this will cause walsender to exit:\n\\c lzldb alter SUBSCRIPTION sub_lzl disable; However, this depends on the downstream\u0026rsquo;s own handling; we can\u0026rsquo;t always quickly shut down the downstream receiver process of DTS and other sync tools.\nSolution 2 test:\nSince walsender registers the SIGTERM signal, and the select pg_terminate_backend($walsender_pid) command run while the database is running also sends SIGTERM to walsender, theoretically just sending SIGTERM to walsender should handle this, without needing kill -9.\nCommand:\nkill -SIGTERM 62834 #same as kill -15 62834 #same as kill 62834 After normal kill, pm and all other processes exit completely.\nCheck the control file and WAL log to confirm consistent shutdown:\npg_controldata database state changed from in production to shut down — consistent shutdown: $ pg_controldata|grep -i state Database cluster state: shut down The last record in the WAL log is CHECKPOINT_SHUTDOWN: pg_waldump 000000010000018600000012|tail -1 pg_waldump: fatal: error in WAL record at 186/915D7920: invalid record length at 186/915D7998: wanted 24, got 0 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 186/915D7920, prev 186/915D78A8, desc: CHECKPOINT_SHUTDOWN redo 186/915D7920; tli 1; prev tli 1; fpw true; xid 0:13431045; oid 3808147; multi 3; offset 6; oldest xid 485 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 494/13431044; oldest running xid 0; shutdown Test: Reproducing Only Primary Having CHECKPOINT_SHUTDOWN # A phenomenon in the production environment was that the local WAL had a shutdown checkpoint but the standby didn\u0026rsquo;t. In production, an immediate stop was performed during mid-shutdown, and then startup failed.\nAt the time, the last 2 WAL records on primary and standby looked something like:\n#Primary WAL: CHECKPOINT_ONLINE CHECKPOINT_SHUTDOWN #Standby WAL: CHECKPOINT_ONLINE Reproduction commands:\n## 1. First reproduce walsender blocking shutdown (skipped) ## 2. Check the last WAL record rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 188/307ABE00, prev 188/307ABDC8, desc: RUNNING_XACTS nextXid 13432445 latestCompletedXid 13432444 oldestRunningXid 13432445 ## 3. pg_ctl stop -D $PGDATA -m i ## 4. Check last WAL record Unchanged, same as 2 ## 5. pg_ctl start -D $PGDATA ## 6. Check last two WAL records rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 188/307ABE00, prev 188/307ABDC8, desc: RUNNING_XACTS nextXid 13432445 latestCompletedXid 13432444 oldestRunningXid 13432445 #same as 2 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 188/307ABE38, prev 188/307ABE00, desc: CHECKPOINT_SHUTDOWN redo 188/307ABE38; tli 1; prev tli 1; fpw true; xid 0:13432445; oid 3832732; multi 3; offset 6; oldest xid 485 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 494/13432444; oldest running xid 0; shutdown #CHECKPOINT_SHUTDOWN appears From this reproduction, CHECKPOINT_SHUTDOWN is actually done during startup!\nThis matches the production sequence: 1. fast shutdown didn\u0026rsquo;t complete 2. immediate shutdown 3. startup failed.\nQuestion 1: When during startup is CHECKPOINT_SHUTDOWN done?\nQuestion 2: When is CHECKPOINT_ONLINE triggered? From reproduction appearances, occasionally fast shutdown results in the last WAL record being CHECKPOINT_ONLINE.\nQuestion 1 analysis:\nDoing a shutdown checkpoint at startup easily suggests the startup process. Since we\u0026rsquo;ve previously analyzed the startup process flow, we can directly locate the function StartupXLOG:\n/* * This must be called ONCE during postmaster or standalone-backend startup */ void StartupXLOG(void) {... if (InRecovery) //Since it was a shutdown stop, instance recovery is needed { /* * Perform a checkpoint to update all our recovery activity to disk. * * Note that we write a shutdown checkpoint rather than an on-line * one. This is not particularly critical, but since we may be * assigning a new TLI, using a shutdown checkpoint allows us to have * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. * * In fast promotion, only create a lightweight end-of-recovery record * instead of a full checkpoint. A checkpoint is requested later, * after we\u0026#39;re fully out of recovery mode and already accepting * queries. */ if (bgwriterLaunched) //This if is clearly for standby streaming replication {... } else //Primary startup goes here CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE); } Doing a shutdown checkpoint is intentional, mainly for TLI logic code robustness Whenever it\u0026rsquo;s not a consistent shutdown, a shutdown checkpoint is performed during startup So, doing -m i forced shutdown and then starting up will also produce CHECKPOINT_SHUTDOWN — self-tested.\nQuestion 2 analysis:\nTested multiple times, occasionally seen. Speculation: it just happened that before shutdown, checkpoint conditions were met and an online checkpoint was triggered — pure coincidence.\nConsidering that after a failed database shutdown, whether it\u0026rsquo;s a script, HA, or manual intervention, forced shutdown may be done, it\u0026rsquo;s recommended to do at least one checkpoint before shutdown.\nTest: Impact of Archiving on Shutdown # While analyzing the shutdown code, I also found that after the checkpointer process exits, reaper for checkpointer sends SIGUSR2 to pgarch for its last archive and exit:\nstatic void reaper(SIGNAL_ARGS) {... if (pid == CheckpointerPID) { CheckpointerPID = 0; if (EXIT_STATUS_0(exitstatus) \u0026amp;\u0026amp; pmState == PM_SHUTDOWN) {... /* Waken archiver for the last time */ if (PgArchPID != 0) signal_child(PgArchPID, SIGUSR2); ... } ... And pm\u0026rsquo;s exit depends on all processes except syslogger having exited:\nif (pmState == PM_WAIT_DEAD_END) { if (dlist_is_empty(\u0026amp;BackendList) \u0026amp;\u0026amp; PgArchPID == 0 \u0026amp;\u0026amp; PgStatPID == 0) { /* These other guys should be dead already */ Assert(StartupPID == 0); Assert(WalReceiverPID == 0); Assert(BgWriterPID == 0); Assert(CheckpointerPID == 0); Assert(WalWriterPID == 0); Assert(AutoVacPID == 0); /* syslogger is not considered here */ pmState = PM_NO_CHILDREN; } } So in production, slow archiving was also found to affect shutdown.\nReproduction commands:\n#Configure archiving archive_mode = on archive_command = \u0026#39;/bin/false ;sleep 1000\u0026#39;#Set archiving to always fail with sleep to bypass NUM_ARCHIVE_RETRIES logic #Shutdown pg_ctl stop -D $PGDATA -m fast Processes after shutdown:\n$ ps -axjf|grep 61470 72200 88406 88405 68705 pts/48 88405 S+ 6001 0:00 \\_ grep --color=auto 61470 1 61470 61470 61470 ? -1 Ss 6001 0:00 /myhost/postgres/base/rasesql1.5.6/bin/postgres -D /myhost/pg8094/data 61470 61772 61772 61772 ? -1 Ss 6001 0:00 \\_ postgres: lzlpg: logger 61470 63880 63880 63880 ? -1 Ss 6001 0:00 \\_ postgres: lzlpg: archiver archiving 000000010000018800000007 Since the checkpointer here has already fully stopped, the database is in a consistent state, so using kill -9 on archiver is fine.\nOne-Sentence Summary # Q1: Why didn\u0026rsquo;t shutdown complete?\nWalsender blocked shutdown. Checkpointer sent SIGUSR1 to walsender and infinitely waited for all walsender processes to enter stopping state; checkpointer got stuck at this step.\nThe shutdown eventually completed due to -m i forced shutdown.\nQ2: Is there a graceful way to shut down from the mid-shutdown state caused by walsender blocking?\nYes. Send SIGTERM (i.e. kill, or kill -15, kill -SIGTERM) to all walsenders. Afterwards, checkpointer and postmaster will complete a clean shutdown.\nWalsender registers the SIGTERM signal at startup, and testing shows no scenario where it can\u0026rsquo;t be handled.\nSIGTERM is also the signal sent by pg_terminate_backend(pid), and it\u0026rsquo;s the command that should be executed to stop walsender during a standard shutdown.\nQ3: Why did primary and standby differ by exactly one shutdown checkpoint?\n3.1 Explanation for both primary and standby having CHECKPOINT_ONLINE:\nThe primary triggering CHECKPOINT_ONLINE was purely coincidental Since the physical walsender was still there, this WAL record was transmitted to the standby 3.2 Explanation for only primary having CHECKPOINT_SHUTDOWN:\nThis CHECKPOINT_SHUTDOWN was done during primary startup Since the primary hadn\u0026rsquo;t fully started, this WAL record wasn\u0026rsquo;t transmitted to the standby Q4: Why does archiver block shutdown?\nWhen reaping the checkpointer process, pm tells archiver to do one last archive, and pm depends on all processes except syslogger having exited. So if the last archive is slow or has issues, it blocks shutdown. Archive failure won\u0026rsquo;t — the archiver process exits quickly on failure.\nQ5: Which processes can block shutdown?\nActually, any process not exiting can block shutdown. The question is which ones are more likely to cause trouble. From the shutdown code flow, archiver and walsender commonly block shutdown because they perform a last archive or log transmission during the shutdown phase.\nReferences # https://www.postgresql.org/docs/current/server-shutdown.html https://wiki.postgresql.org/wiki/Signals postgres.c postmaster.c walsender.c xlog.c checkpointer.c startup.c pgarch.c\n","date":"Jan 4, 2025","externalUrl":null,"permalink":"/en/2025/01/04/pg-shutdown-logic-and-walsender-blocking-shutdown-analysis/","section":"Posts","summary":"Walsender Blocking Shutdown Symptoms # Production shutdown log output:\n2024-12-06 17:00:02.036 CST,,,447560,,65693cde.6d448,1320,,2023-12-01 09:54:38 CST,,0,LOG,00000,\"received fast shutdown request\",,,,,,,,,\"\",\"postmaster\" 2024-12-06 17:00:02.295 CST,,,447560,,65693cde.6d448,1322,,2023-12-01 09:54:38 CST,,0,LOG,00000,\"background worker \"\"logical replication launcher\"\" (PID 448996) exited with exit code 1\",,,,,,,,,\"\",\"postmaster\" 2024-12-06 17:00:10.627 CST,,,448990,,65693ce0.6d9de,213833,,2023-12-01 09:54:40 CST,,0,LOG,00000,\"checkpoint complete: wrote 426844 buffers (5.1%); 0 WAL file(s) added, 0 removed, 5 recycled; write=91.427 s, sync=0.055 s, total=91.508 s; sync files=761, longest=0.028 s, average=0.001 s; distance=2197531 kB, estimate=2680783 kB\",,,,,,,,,\"\",\"checkpointer\" 2024-12-06 17:00:10.628 CST,,,448990,,65693ce0.6d9de,213834,,2023-12-01 09:54:40 CST,,0,LOG,00000,\"shutting down\",,,,,,,,,\"\",\"checkpointer\" ... --checkpointer finished checkpoint and is in shutting down state, pm has not exited --160s later pm receives immediate shutdown, triggered by health check script 2024-12-06 17:02:43.348 CST,,,447560,,65693cde.6d448,1323,,2023-12-01 09:54:38 CST,,0,LOG,00000,\"received immediate shutdown request\",,,,,,,,,\"\",\"postmaster\" 2024-12-06 17:02:43.370 CST,\"logicaluser\",\"lzldb\",283840,\"10.33.77.159:39865\",6751a2dc.454c0,7,\"idle\",2024-12-05 20:55:56 CST,89/847309655,0,WARNING,57P02,\"terminating connection because of crash of another server process\",\"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\",\"In a moment you should be able to reconnect to the database and repeat your command.\",,,,,,,\"Debezium Streaming\",\"walsender\" 2024-12-06 17:02:43.370 CST,\"logicaluser\",\"lzldb\",157641,\"10.33.77.159:39407\",67408354.267c9,7,\"idle\",2024-11-22 21:12:52 CST,9/3193590104,0,WARNING,57P02,\"terminating connection because of crash of another server process\",\"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\",\"In a moment you should be able to reconnect to the database and repeat your command.\",,,,,,,\"Debezium Streaming\",\"walsender\" 2024-12-06 17:02:43.370 CST,\"logicaluser\",\"lzldb\",157916,\"10.33.77.159:57038\",67408356.268dc,7,\"idle\",2024-11-22 21:12:54 CST,115/3293293502,0,WARNING,57P02,\"terminating connection because of crash of another server process\",\"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\",\"In a moment you should be able to reconnect to the database and repeat your command.\",,,,,,,\"Debezium Streaming\",\"walsender\" 2024-12-06 17:02:43.370 CST,\"repuser\",\"\",164392,\"30.151.40.19:41641\",66b25869.28228,3,\"streaming 42D3B/1732C5F0\",2024-08-07 01:07:53 CST,296/0,0,WARNING,57P02,\"terminating connection because of crash of another server process\",\"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\",\"In a moment you should be able to reconnect to the database and repeat your command.\",,,,,,,\"standby_6666\",\"walsender\" 2024-12-06 17:02:43.371 CST,,,447560,,65693cde.6d448,1324,,2023-12-01 09:54:38 CST,,0,LOG,00000,\"archiver process (PID 448994) exited with exit code 2\",,,,,,,,,\"\",\"postmaster\" 2024-12-06 17:02:43.371 CST,\"logicaluser\",\"lzldb\",57755,\"10.33.77.159:38918\",67125534.e19b,7,\"idle\",2024-10-18 20:31:48 CST,243/902018192,0,WARNING,57P02,\"terminating connection because of crash of another server process\",\"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\",\"In a moment you should be able to reconnect to the database and repeat your command.\",,,,,,,\"Debezium Streaming\",\"walsender\" 2024-12-06 17:02:43.372 CST,\"logicaluser\",\"lzldb\",157915,\"10.33.77.159:43433\",67408356.268db,7,\"idle\",2024-11-22 21:12:54 CST,60/3248014863,0,WARNING,57P02,\"terminating connection because of crash of another server process\",\"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\",\"In a moment you should be able to reconnect to the database and repeat your command.\",,,,,,,\"Debezium Streaming\",\"walsender\" --pm finished shutting down 2024-12-06 17:02:57.534 CST,,,447560,,65693cde.6d448,1325,,2023-12-01 09:54:38 CST,,0,LOG,00000,\"database system is shut down\",,,,,,,,,\"\",\"postmaster\" 2024-12-06 17:03:49.536 CST,,,211844,,6752bdf3.33b84,1,,2024-12-06 17:03:47 CST,,0,LOG,00000,\"ending log output to stderr\",,\"Future log output will go to log destination \"\"csvlog\"\".\",,,,,,,\"\",\"postmaster\" 17:00:02 postmaster receives fast shutdown\n","title":"PG Shutdown Logic and Walsender Blocking Shutdown Analysis","type":"posts"},{"content":" Problem Symptom — Slow Startup # Version: PG 13.2\nDatabase startup was slow. The startup process was reading spill files, and the filenames kept changing. Checking the spill files was also very slow — ls -l eventually showed 8 million spill files.\nWhy Tens of Millions of Spill Files? # WAL Segment and LSN Meaning # LSN # LSN is a 64-bit bigint. An LSN actually looks like 42D3B/1732C540 (hex). Before the slash / is the 32-bit logical log number, and after the / are 32 bits split into segment number + block number + intra-block offset. These 4 parts are:\n32 bits 8 bits 11 bits 13 bits Logical log number Log segment number Block number Intra-block offset Intra-block offset 8192 = 2^13\nBlock number = 16M (default WAL segment size) / 8192\nWAL Segment # A WAL filename consists of 3 groups of hex digits.\nTaking the 8k WAL file 0000000300042D3B00000002 as example:\n32 bits 32 bits 32 bits timeline Logical log number Log segment number 00000003 00042D3B 00000002 It can be seen that an LSN can locate a WAL filename and the offset position within the file.\nAmong these, the part before the LSN slash / is the logical log number, and the 8-bit log segment number after the slash / will be used below.\nSpill Filename Conversion # Replication slot name: logical_ex2209_rep\nSpill filename: xid-407989064-lsn-42D1E-20000000.spill\n42D1E is not a complete LSN and cannot be directly used with pg_walfile_name to locate a WAL filename. 42D1E is a logical log number. If we directly filter WAL files containing 42D1E in the name, we find 16 WAL files.\nCan we locate the WAL log segment number from the number 20000000 to pinpoint the exact file?\nSpill filename generation:\n/* * Given a replication slot, transaction ID and segment number, fill in the * corresponding spill file into \u0026#39;path\u0026#39;, which is a caller-owned buffer of size * at least MAXPGPATH. */ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid, XLogSegNo segno) { XLogRecPtr recptr; XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr); snprintf(path, MAXPGPATH, \u0026#34;pg_replslot/%s/xid-%u-lsn-%X-%X.spill\u0026#34;, NameStr(MyReplicationSlot-\u0026gt;data.name), xid, (uint32) (recptr \u0026gt;\u0026gt; 32), (uint32) recptr); } The pg_replslot/%s and xid-%u-lsn parts are easy to understand — just the replication slot name and xid. The recptr needs a closer look at its definition:\n/* * Pointer to a location in the XLOG. These pointers are 64 bits wide, * because we don\u0026#39;t want them ever to overflow. */ typedef uint64 XLogRecPtr; XLogSegNoOffsetToRecPtr calculates the LSN from the WAL log segment number, segment size, and offset:\n#define XLogSegNoOffsetToRecPtr(segno, offset, wal_segsz_bytes, dest) \\ (dest) = (segno) * (wal_segsz_bytes) + (offset) XLogRecPtr is the LSN! So:\n(uint32) (recptr \u0026gt;\u0026gt; 32) takes the first 32 bits of LSN, (uint32) recptr) takes the last 32 bits.\nThe first 32 bits of LSN is what we saw as the first half of LSN, lsn-42D1E. The last 32 bits of LSN actually contain more information; here we only need the first few bits of the last 32 bits — the segment number.\nSince the passed-in offset=0 and we also have segno, we don\u0026rsquo;t actually need the intra-block offset information to calculate the dest value. The real value of wal_segsz_bytes is 128M = 128*1024*1024. Converting the formula in XLogSegNoOffsetToRecPtr:\nsegno= dest/(128*1024*1024) -- Convert hex 20000000 segno= x\u0026#39;20000000\u0026#39;::int/(128*1024*1024) segno= 4 From this formula we can derive the log segment number segno, which lets us locate the WAL file number.\nSo the spill filename xid-407989064-lsn-42D1E-20000000.spill corresponds to the WAL file:\nLogical log number=42D1E, segment number=04:\nls 42D1E*04 0000000200042D1E00000004 pg_waldump shows xid 407989064 inside.\nIn practice, the WAL size is also fixed after instance creation, i.e. (128*1024*1024) is a constant, so segno is absolutely correlated with (uint32) recptr, but not equal to it. This means that switching to a new WAL log file creates a new spill file.\nSummary of spill file generation rules:\nSame transaction id: if it spans multiple WAL files, it produces multiple spills. E.g., a large transaction without subtransactions spanning 3 WAL files produces 3 spill files. Different transaction ids produce different spills. E.g., 10 million subtransactions produce 10 million spill files. Spill filename structure xid-407989064-lsn-42D1E-20000000.spill:\nxid First 32 bits of LSN; i.e., WAL logical log number Converted from WAL log segment number; not equal to segment number xid-407989064 lsn-42D1E 20000000 ## Recovered environment [postgres]$ ll |head -100 total 40000276 -rw------- 1 postgres postgres 184 Dec 6 15:20 state -rw------- 1 postgres postgres 196 Dec 6 13:25 xid-407989064-lsn-42D1E-0.spill -rw------- 1 postgres postgres 208 Dec 6 13:25 xid-407989064-lsn-42D1E-20000000.spill ... -rw------- 1 postgres postgres 540 Dec 6 16:44 xid-407989064-lsn-42D2A-D0000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989065-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989066-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989068-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989070-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989072-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989076-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989079-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989080-lsn-42D1D-C8000000.spill -rw------- 1 postgres postgres 152 Dec 6 13:09 xid-407989082-lsn-42D1D-C8000000.spill [postgres@lzlhost /myhost/liuzhilong/pg_replslot/logical_ex9e15_rep]$ ll |awk \u0026#39;{print $9}\u0026#39;|awk -F \u0026#39;-\u0026#39; \u0026#39;{print $2}\u0026#39;|sort|uniq -c|wc -l 10000003 [postgres@lzlhost /myhost/liuzhilong/pg_replslot/logical_ex9e15_rep]$ ll |wc -l 10000070 So in the actual environment we saw 10,000,070 files, with 10,000,003 distinct xids among them — meaning 1 parent transaction spanning about 70 WAL files, with this parent transaction having 10 million subtransactions.\nReplication Slot Spill Testing # --Pub/sub replication link setup logical_decoding_work_mem = 64MB #pg_ctl reload wal_segment_size =128 MB --source CREATE TABLE replication_table ( id BIGSERIAL PRIMARY KEY, column1 char(2000), column2 char(2000), column3 char(2000) ); create publication pub_test for table replication_table ; --dest CREATE TABLE replication_table ( id BIGSERIAL PRIMARY KEY, column1 char(2000), column2 char(2000), column3 char(2000) ); CREATE SUBSCRIPTION sub_test CONNECTION \u0026#39;host=127.0.0.1 port=8094 dbname=lzl user=lzl password=qwer\u0026#39; PUBLICATION pub_test; --source select * from pg_replication_slots; Large Transaction, No Subtransactions, Replicated Table Spill Test # --Create a large transaction, don\u0026#39;t commit yet begin; insert into replication_table(column1,column2,column3) select \u0026#39;a\u0026#39;,\u0026#39;b\u0026#39;,\u0026#39;c\u0026#39; from generate_series(1,1000000) g; --Replication slot spill ll total 331924 -rw------- 1 postgres postgres 184 Dec 9 20:22 state -rw------- 1 postgres postgres 88226964 Dec 9 20:22 xid-5074343-lsn-163-38000000.spill -rw------- 1 postgres postgres 119698488 Dec 9 20:22 xid-5074343-lsn-163-40000000.spill After the large transaction commits, wait for consumption until replication lag is 0, and the spill files disappear:\nM=# select pid,usename,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,reply_time from pg_stat_replication; pid | usename | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | reply_time --------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+------------------------------ 163525 | lzl | 163/4996E1C8 | 163/4996E1C8 | 163/4996E1C8 | 163/4996E1C8 | [null] | [null] | [null] | 2024-12-09 20:25:35.14769+08 (1 row) M=# select pid,usename,pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) diff_sent_mb,pg_wal_lsn_diff(pg_current_wal_lsn(),write_lsn) diff_write_mb,pg_wal_lsn_diff(pg_current_wal_lsn(),flush_lsn) diff_flush_mb,pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) diff_replay_mb,pg_walfile_name_offset(sent_lsn) sentoffset,pg_walfile_name_offset(write_lsn) writeoffset,pg_walfile_name_offset(flush_lsn) flush_lsn from pg_stat_replication; pid | usename | diff_sent_mb | diff_write_mb | diff_flush_mb | diff_replay_mb | sentoffset | writeoffset | flush_lsn --------+---------+--------------+---------------+---------------+----------------+-------------------------------------+-------------------------------------+------------------------------- 163525 | lzl | 0 | 0 | 0 | 0 | (000000010000016300000009,26665416) | (000000010000016300000009,26665416) | (000000 [/mypg/pg8094/data/pg_replslot/sub_test]$ ll total 357392 -rw------- 1 postgres postgres 184 Dec 9 20:23 state -rw------- 1 postgres postgres 88226964 Dec 9 20:22 xid-5074343-lsn-163-38000000.spill -rw------- 1 postgres postgres 137696328 Dec 9 20:23 xid-5074343-lsn-163-40000000.spill -rw------- 1 postgres postgres 26076708 Dec 9 20:23 xid-5074343-lsn-163-48000000.spill [/mypg/pg8094/data/pg_replslot/sub_test]$ ll total 4 -rw------- 1 postgres postgres 184 Dec 9 20:25 state2666 (1 row) Large Transaction, No Subtransactions, Non-Replicated Table Spill Test # --source: create an unrelated table for writing data CREATE TABLE no_replication_table ( id BIGSERIAL PRIMARY KEY, column1 char(2000), column2 char(2000), column3 char(2000) ); --Create a large transaction, don\u0026#39;t commit yet begin; insert into no_replication_table(column1,column2,column3) select \u0026#39;a\u0026#39;,\u0026#39;b\u0026#39;,\u0026#39;c\u0026#39; from generate_series(1,1000000) g; --Spill [postgres@lzldb:MYINST:8094 /mypg/pg8094/data/pg_replslot/sub_test]$ ll total 357492 -rw------- 1 postgres postgres 184 Dec 9 20:09 state -rw------- 1 postgres postgres 107511456 Dec 9 20:08 xid-5074106-lsn-163-28000000.spill -rw------- 1 postgres postgres 137698804 Dec 9 20:09 xid-5074106-lsn-163-30000000.spill -rw------- 1 postgres postgres 4308444 Dec 9 20:09 xid-5074106-lsn-163-38000000.spill Large Transaction, Subtransactions, Non-Replicated Table Spill Test # ## One insert per row, each insert as one subtransaction echo \u0026#34;begin;\u0026#34;\u0026gt;subtx.sql for i in {1..1000000} do echo \u0026#34;savepoint p$i;\u0026#34;\u0026gt;\u0026gt;subtx.sql echo \u0026#34;insert into no_replication_table(column1,column2,column3) select \u0026#39;a\u0026#39;,\u0026#39;b\u0026#39;,\u0026#39;c\u0026#39;;\u0026#34;\u0026gt;\u0026gt;subtx.sql done nohup psql -d lzl -f subtx.sql \u0026amp; #During execution, observed 800k+ spill files [/myhost/pg8094/data/pg_replslot/sub_test]$ ll |wc -l 823749 [/myhost/pg8094/data/pg_replslot/sub_test]$ ll |head -10 total 1099532 -rw------- 1 postgres postgres 184 Dec 9 21:10 state -rw------- 1 postgres postgres 1236 Dec 9 21:10 xid-5519686-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519687-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519688-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519689-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519690-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519691-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519692-lsn-163-70000000.spill -rw------- 1 postgres postgres 252 Dec 9 21:09 xid-5519693-lsn-163-70000000.spill Analysis of Slow Database Startup # Startup Process Startup Flow Analysis # Here we parse the startup flow frame by frame using the call stack:\n11: main: Nothing to say.\n10: PostmasterMain:\nBefore the main loop, it first calls the startup flow StartupPID = StartupDataBase(); which essentially calls StartChildProcess(StartupProcess):\n#define StartupDataBase()\tStartChildProcess(StartupProcess) 9: StartChildProcess: Forks a process. This process is the auxiliary process for starting postmaster; normal child process startup goes through this logic, forking at this step. The input AuxProcType=StartupProcess.\n8: AuxiliaryProcessMain:\nSince MyAuxProcType=StartupProcess, it goes through the StartupProcessMain flow, which is different from child processes like walsender, walwriter, bgwriter. The startup process itself exists for crash recovery WAL reading, but it does many other things:\nswitch (MyAuxProcType) { case CheckerProcess: /* don\u0026#39;t set signals, they\u0026#39;re useless here */ CheckerModeMain(); proc_exit(1);\t/* should never return */ case BootstrapProcess: /* * There was a brief instant during which mode was Normal; this is * okay. We need to be in bootstrap mode during BootStrapXLOG for * the sake of multixact initialization. */ SetProcessingMode(BootstrapProcessing); bootstrap_signals(); BootStrapXLOG(); BootstrapModeMain(); proc_exit(1);\t/* should never return */ case StartupProcess: //Here here here here /* don\u0026#39;t set signals, startup process has its own agenda */ StartupProcessMain(); proc_exit(1);\t/* should never return */ case BgWriterProcess: /* don\u0026#39;t set signals, bgwriter has its own agenda */ BackgroundWriterMain(); proc_exit(1);\t/* should never return */ case CheckpointerProcess: /* don\u0026#39;t set signals, checkpointer has its own agenda */ CheckpointerMain(); proc_exit(1);\t/* should never return */ case WalWriterProcess: /* don\u0026#39;t set signals, walwriter has its own agenda */ InitXLOGAccess(); WalWriterMain(); proc_exit(1);\t/* should never return */ case WalReceiverProcess: /* don\u0026#39;t set signals, walreceiver has its own agenda */ WalReceiverMain(); proc_exit(1);\t/* should never return */ default: elog(PANIC, \u0026#34;unrecognized process type: %d\u0026#34;, (int) MyAuxProcType); proc_exit(1); } 7: StartupProcessMain: Mainly to call StartupXLOG().\n6: StartupXLOG:\nFunction comment:\nThis must be called ONCE during postmaster or standalone-backend startup StartupXLOG is always called by postmaster regardless, whether crash shutdown or consistent shutdown:\nswitch (ControlFile-\u0026gt;state) { ... case DB_IN_PRODUCTION: ereport(LOG, (errmsg(\u0026#34;database system was interrupted; last known up at %s\u0026#34;, str_time(ControlFile-\u0026gt;time)))); break; This matches the log output. Here\u0026rsquo;s the shutdown and startup log:\n2024-12-06 17:02:57.534 CST,,,447560,,65693cde.6d448,1325,,2023-12-01 09:54:38 CST,,0,LOG,00000,\u0026#34;database system is shut down\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:03:49.536 CST,,,211844,,6752bdf3.33b84,1,,2024-12-06 17:03:47 CST,,0,LOG,00000,\u0026#34;ending log output to stderr\u0026#34;,,\u0026#34;Future log output will go to log destination \u0026#34;\u0026#34;csvlog\u0026#34;\u0026#34;.\u0026#34;,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:03:49.536 CST,,,211844,,6752bdf3.33b84,2,,2024-12-06 17:03:47 CST,,0,LOG,00000,\u0026#34;starting PostgreSQL 13.2 (RaseSQL 1.3) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39.0.1), 64-bit\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:03:49.537 CST,,,211844,,6752bdf3.33b84,3,,2024-12-06 17:03:47 CST,,0,LOG,00000,\u0026#34;listening on IPv4 address \u0026#34;\u0026#34;0.0.0.0\u0026#34;\u0026#34;, port 7284\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:03:49.539 CST,,,211844,,6752bdf3.33b84,4,,2024-12-06 17:03:47 CST,,0,LOG,00000,\u0026#34;listening on Unix socket \u0026#34;\u0026#34;/tmp/.s.PGSQL.7284\u0026#34;\u0026#34;\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; 2024-12-06 17:03:49.557 CST,,,211995,,6752bdf5.33c1b,1,,2024-12-06 17:03:49 CST,,0,LOG,00000,\u0026#34;database system was interrupted; last known up at 2024-12-06 17:00:10 CST\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;startup\u0026#34; So, after shutdown, the control file recorded the database state as in production:\nDatabase cluster state: in production The in production state means the database is running, not a normal shutdown state — indicating that at the time of shutdown, it was not a consistent shutdown.\nContinuing with the key code about fsync:\n/*---------- * If we previously crashed, perform a couple of actions: * * - The pg_wal directory may still include some temporary WAL segments * used when creating a new segment, so perform some clean up to not * bloat this path. This is done first as there is no point to sync * this temporary data. * * - There might be data which we had written, intending to fsync it, but * which we had not actually fsync\u0026#39;d yet. Therefore, a power failure in * the near future might cause earlier unflushed writes to be lost, even * though more recent data written to disk from here on would be * persisted. To avoid that, fsync the entire data directory. */ if (ControlFile-\u0026gt;state != DB_SHUTDOWNED \u0026amp;\u0026amp; ControlFile-\u0026gt;state != DB_SHUTDOWNED_IN_RECOVERY) { RemoveTempXlogFiles(); SyncDataDirectory(); } Here, because the control file state is not a normal shutdown, it enters the if-block and calls SyncDataDirectory() for fsync persistence.\nStartupXLOG does many many things. Among those related to spill, besides SyncDataDirectory(), there\u0026rsquo;s also StartupReorderBuffer():\n/* * Initialize replication slots, before there\u0026#39;s a chance to remove * required resources. */ StartupReplicationSlots(); /* * Startup logical state, needs to be setup now so we have proper data * during crash recovery. */ StartupReorderBuffer(); StartupReorderBuffer is also called. It calls ReorderBufferCleanupSerializedTXNs to clean up spill files in all slot directories (but does not delete directories or state files):\n/* * Delete all data spilled to disk after we\u0026#39;ve restarted/crashed. It will be * recreated when the respective slots are reused. */ void StartupReorderBuffer(void) { DIR\t*logical_dir; struct dirent *logical_de; logical_dir = AllocateDir(\u0026#34;pg_replslot\u0026#34;); while ((logical_de = ReadDir(logical_dir, \u0026#34;pg_replslot\u0026#34;)) != NULL) { if (strcmp(logical_de-\u0026gt;d_name, \u0026#34;.\u0026#34;) == 0 || strcmp(logical_de-\u0026gt;d_name, \u0026#34;..\u0026#34;) == 0) continue; /* if it cannot be a slot, skip the directory */ if (!ReplicationSlotValidateName(logical_de-\u0026gt;d_name, DEBUG2)) continue; /* * ok, has to be a surviving logical slot, iterate and delete * everything starting with xid-* */ ReorderBufferCleanupSerializedTXNs(logical_de-\u0026gt;d_name); } FreeDir(logical_dir); } 5: SyncDataDirectory:\nThe function comment is very important:\n/* * Issue fsync recursively on PGDATA and all its contents. * * We fsync regular files and directories wherever they are, but we * follow symlinks only for pg_wal and immediately under pg_tblspc. * Other symlinks are presumed to point at files we\u0026#39;re not responsible * for fsyncing, and might not have privileges to write at all. * * Errors are logged but not considered fatal; that\u0026#39;s because this is used * only during database startup, to deal with the possibility that there are * issued-but-unsynced writes pending against the data directory. We want to * ensure that such writes reach disk before anything that\u0026#39;s done in the new * run. However, aborting on error would result in failure to start for * harmless cases such as read-only files in the data directory, and that\u0026#39;s * not good either. * * Note that if we previously crashed due to a PANIC on fsync(), we\u0026#39;ll be * rewriting all changes again during recovery. * * Note we assume we\u0026#39;re chdir\u0026#39;d into PGDATA to begin with. */ fsync all data directory files to persist them This action only happens during the startup phase This action ensures the data directory is fully persistent before the database starts running The body of SyncDataDirectory recursively walks directories and fsyncs (with some special handling for symlinks):\nwalkdir(\u0026#34;.\u0026#34;, datadir_fsync_fname, false, LOG); if (xlog_is_symlink) walkdir(\u0026#34;pg_wal\u0026#34;, datadir_fsync_fname, false, LOG); walkdir(\u0026#34;pg_tblspc\u0026#34;, datadir_fsync_fname, true, LOG); 4: walkdir: Recurse to .\n3: walkdir: Recurse to ./pg_replslot\n2: walkdir: Recurse to ./pg_replslot/slotname\n1: lstat: C library call. walkdir not only does fsync (via the callback datadir_fsync_fname), the walkdir function body also does lstat to get file info such as inode, file size, last modification time, etc. — similar to the Linux stat command.\n0: _lxstat: C library call.\nStartup logic summary:\nPG starts an auxiliary process startup to help with startup. Unlike common child processes (walwriter, bgwriter, checkpointer, etc.), it\u0026rsquo;s always started during the startup process and does many things. StartupXLOG is always called during startup, whether or not the database was consistently shut down. Only in a non-normal shutdown state does SyncDataDirectory get triggered. SyncDataDirectory fsyncs all data files for persistence and checks stat info for all data files. fsync ensures data file consistency before startup; stat is probably to verify files are normal and readable (before the startup process starts, only the readability of the datadir directory was verified). Regardless of shutdown state, StartupReorderBuffer is always called and cleans up spill files for all replication slots. When Is the Ready State? # After the startup process finishes its work, the database is not yet in ready state. When the pmState state machine changes state, the reaper process reaping function is called. The reaper function itself does some recovery or startup work after a child process exits. The pmState state machine records the state as PM_STARTUP, which controls the startup/shutdown state.\nLast steps of PostmasterMain:\nStartupPID = StartupDataBase(); Assert(StartupPID != 0); StartupStatus = STARTUP_RUNNING; pmState = PM_STARTUP; //State machine changes state /* Some workers may be scheduled to start now */ maybe_start_bgworkers(); status = ServerLoop(); /* * ServerLoop probably shouldn\u0026#39;t ever return, but if it does, close down. */ ExitPostmaster(status != STATUS_OK); abort();\t/* not reached */ } The core startup flow of PostmasterMain goes to reaper to handle the normal exit of the startup process.\nPMState comment:\n/* * We use a simple state machine to control startup, shutdown, and * crash recovery (which is rather like shutdown followed by startup). * * After doing all the postmaster initialization work, we enter PM_STARTUP * state and the startup process is launched. The startup process begins by * reading the control file and other preliminary initialization steps. * In a normal startup, or after crash recovery, the startup process exits * with exit code 0 and we switch to PM_RUN state. PMState is passed and processed via signals. After the startup process exits, reaper is activated to reap the process.\nreaper function handling the startup child process\u0026rsquo;s normal exit:\nif (pid == StartupPID) { StartupPID = 0; ... /* * Startup succeeded, commence normal operations */ StartupStatus = STARTUP_NOT_RUNNING; //Transition from STARTUP_RUNNING to STARTUP_NOT_RUNNING FatalError = false; //After none of the above ifs are hit, it\u0026#39;s not fatal AbortStartTime = 0; ReachedNormalRunning = true; pmState = PM_RUN; //State machine transitions from PM_STARTUP to PM_RUN connsAllowed = ALLOW_ALL_CONNS; /* * Crank up the background tasks, if we didn\u0026#39;t do that already * when we entered consistent recovery state. It doesn\u0026#39;t matter * if this fails, we\u0026#39;ll just try again later. */ //Below: starting core child processes if (CheckpointerPID == 0) CheckpointerPID = StartCheckpointer(); if (BgWriterPID == 0) BgWriterPID = StartBackgroundWriter(); if (WalWriterPID == 0) WalWriterPID = StartWalWriter(); /* * Likewise, start other special children as needed. In a restart * situation, some of them may be alive already. */ //Below: starting non-core child processes if (!IsBinaryUpgrade \u0026amp;\u0026amp; AutoVacuumingActive() \u0026amp;\u0026amp; AutoVacPID == 0) AutoVacPID = StartAutoVacLauncher(); if (PgArchStartupAllowed() \u0026amp;\u0026amp; PgArchPID == 0) PgArchPID = pgarch_start(); if (PgStatPID == 0) PgStatPID = pgstat_start(); /* workers may be scheduled to start now */ maybe_start_bgworkers(); //At this point it\u0026#39;s officially ready to accept connections /* at this point we are really open for business */ ereport(LOG, (errmsg(\u0026#34;database system is ready to accept connections\u0026#34;))); /* Report status */ AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_READY); #ifdef USE_SYSTEMD sd_notify(0, \u0026#34;READY=1\u0026#34;); #endif continue; } The \u0026ldquo;database system is ready to accept connections\u0026rdquo; message is right here.\nCheckpointer, bgwriter, walwriter, autovacuum, arch (if present), stats — all these processes need to be started. At this stage, launching these processes doesn\u0026rsquo;t have to return success; they can be retried later in ServerLoop or on the next reaper execution. Only the startup process must start and complete all related tasks in one shot:\nif (pid \u0026lt; 0) { /* in parent, fork failed */ int\tsave_errno = errno; errno = save_errno; switch (type) { case StartupProcess: ereport(LOG, (errmsg(\u0026#34;could not fork startup process: %m\u0026#34;))); break; case BgWriterProcess: ereport(LOG, (errmsg(\u0026#34;could not fork background writer process: %m\u0026#34;))); break; case CheckpointerProcess: ereport(LOG, (errmsg(\u0026#34;could not fork checkpointer process: %m\u0026#34;))); break; case WalWriterProcess: ereport(LOG, (errmsg(\u0026#34;could not fork WAL writer process: %m\u0026#34;))); break; case WalReceiverProcess: ereport(LOG, (errmsg(\u0026#34;could not fork WAL receiver process: %m\u0026#34;))); break; default: ereport(LOG, (errmsg(\u0026#34;could not fork process: %m\u0026#34;))); break; } /* * fork failure is fatal during startup, but there\u0026#39;s no need to choke * immediately if starting other child types fails. */ if (type == StartupProcess) ExitPostmaster(1); return 0; } Spill File Generation Logic Across Versions # Spill in all versions spills the largest transaction. Here we focus on when spilling happens.\nPG12: pg12 hard-codes 4096 changes:\nstatic const Size max_changes_in_memory = 4096; /* * Check whether the transaction tx should spill its data to disk. */ static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { /* * TODO: improve accounting so we cheaply can take subtransactions into * account here. */ if (txn-\u0026gt;nentries_mem \u0026gt;= max_changes_in_memory) { ReorderBufferSerializeTXN(rb, txn); Assert(txn-\u0026gt;nentries_mem == 0); } } PG13: Spills when exceeding logical_decoding_work_mem memory size:\nstatic void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) { ... while (rb-\u0026gt;size \u0026gt;= logical_decoding_work_mem * 1024L) { /* * Pick the largest transaction (or subtransaction) and evict it from * memory by serializing it to disk. */ txn = ReorderBufferLargestTXN(rb); ReorderBufferSerializeTXN(rb, txn); ... } PG14: Adds streaming transfer ReorderBufferStreamTXN:\nstatic void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) { ... while (rb-\u0026gt;size \u0026gt;= logical_decoding_work_mem * 1024L) { /* * Pick the largest transaction (or subtransaction) and evict it from * memory by streaming, if possible. Otherwise, spill to disk. */ if (ReorderBufferCanStartStreaming(rb) \u0026amp;\u0026amp; (txn = ReorderBufferLargestTopTXN(rb)) != NULL) {... ReorderBufferStreamTXN(rb, txn); } else {... ReorderBufferSerializeTXN(rb, txn); } ... } Although PG14 has streaming replication, triggering it requires certain conditions:\n/* Returns true, if the streaming can be started now, false, otherwise. */ static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb) { LogicalDecodingContext *ctx = rb-\u0026gt;private_data; SnapBuild *builder = ctx-\u0026gt;snapshot_builder; /* We can\u0026#39;t start streaming unless a consistent state is reached. */ if (SnapBuildCurrentState(builder) \u0026lt; SNAPBUILD_CONSISTENT) return false; /* * We can\u0026#39;t start streaming immediately even if the streaming is enabled * because we previously decoded this transaction and now just are * restarting. */ if (ReorderBufferCanStream(rb) \u0026amp;\u0026amp; !SnapBuildXactNeedsSkip(builder, ctx-\u0026gt;reader-\u0026gt;EndRecPtr)) return true; return false; } /* * Found a point after SNAPBUILD_FULL_SNAPSHOT where all transactions that * were running at that point finished. Till we reach that we hold off * calling any commit callbacks. */ SNAPBUILD_CONSISTENT = 2 Additional streaming trigger conditions:\nCondition 1: All transactions covered by the snapshot have completed (presumably committed or rolled back) Condition 2: The context is private data (does this mean two links to one table won\u0026rsquo;t trigger streaming?) Condition 3: Transactions in the snapshot are not skippable (probably some special transactions can be skipped) PG15: Similar to 14, just cleaner functions with less nesting.\nPG16: About the same.\nPG17: About the same, adds DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE to force streaming.\nKey points to remember:\nPG12 and earlier: hard-coded 4096 changes PG13: adds logical_decoding_work_mem parameter, allowing memory size adjustment to reduce spill probability PG14 and later: supports streaming replication Triggering streaming also requires certain conditions, so even with streaming, spills can still happen PG17: adds debug_logical_replication_streaming parameter to force streaming Spill File Cleanup Logic # Startup-time spill cleanup is just one scenario. There\u0026rsquo;s also walsender startup cleanup and drop slot cleanup.\nWalsender Startup Cleanup # ReorderBufferCleanupSerializedTXNs is called during database startup (before walsender has started) and during walsender startup (while the database is running). Note these are different scenarios, though they call the same function. From the function comment, it\u0026rsquo;s meant to \u0026ldquo;remove leftover serialized reorder buffers\u0026rdquo; — i.e., clean up spill files.\n/* * Remove any leftover serialized reorder buffers from a slot directory after a * prior crash or decoding session exit. */ static void ReorderBufferCleanupSerializedTXNs(const char *slotname) { DIR\t*spill_dir; struct dirent *spill_de; struct stat statbuf; char\tpath[MAXPGPATH * 2 + 12]; sprintf(path, \u0026#34;pg_replslot/%s\u0026#34;, slotname); /* we\u0026#39;re only handling directories here, skip if it\u0026#39;s not ours */ if (lstat(path, \u0026amp;statbuf) == 0 \u0026amp;\u0026amp; !S_ISDIR(statbuf.st_mode)) return; spill_dir = AllocateDir(path); while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL) { /* only look at names that can be ours */ //Only compare first 3 characters if (strncmp(spill_de-\u0026gt;d_name, \u0026#34;xid\u0026#34;, 3) == 0) { snprintf(path, sizeof(path), \u0026#34;pg_replslot/%s/%s\u0026#34;, slotname, spill_de-\u0026gt;d_name); if (unlink(path) != 0) ereport(ERROR, (errcode_for_file_access(), errmsg(\u0026#34;could not remove file \\\u0026#34;%s\\\u0026#34; during removal of pg_replslot/%s/xid*: %m\u0026#34;, path, slotname))); } } FreeDir(spill_dir); } Two things to note about the above cleanup logic:\nCleans files whose names start with \u0026ldquo;xid\u0026rdquo;. Obviously, the state file is not cleaned. Uses unlink to clean, one file at a time. This can help us devise a faster startup scheme. Database Startup Cleanup # During database startup, a startup process is forked to clean slots. The cleanup function is the same one walsender calls: ReorderBufferCleanupSerializedTXNs.\nOne more difference: after walsender restarts, it only cleans spills for the current slot with the same name; whereas during database startup, all slot spills are cleaned sequentially.\nDatabase startup process, while-loop sequential cleanup logic:\nvoid StartupReorderBuffer(void) { DIR\t*logical_dir; struct dirent *logical_de; logical_dir = AllocateDir(\u0026#34;pg_replslot\u0026#34;); while ((logical_de = ReadDir(logical_dir, \u0026#34;pg_replslot\u0026#34;)) != NULL) {\t//Exclude . and .. if (strcmp(logical_de-\u0026gt;d_name, \u0026#34;.\u0026#34;) == 0 || strcmp(logical_de-\u0026gt;d_name, \u0026#34;..\u0026#34;) == 0) continue; //Validate slot name /* if it cannot be a slot, skip the directory */ if (!ReplicationSlotValidateName(logical_de-\u0026gt;d_name, DEBUG2)) continue; /* * ok, has to be a surviving logical slot, iterate and delete * everything starting with xid-* */ ReorderBufferCleanupSerializedTXNs(logical_de-\u0026gt;d_name); } FreeDir(logical_dir); } The while loop calls ReorderBufferCleanupSerializedTXNs, and after that, the logic is the same as walsender startup cleanup.\nManual Cleanup via pg_drop_replication_slot # The drop slot cleanup logic is different from the automatic spill file cleanup — it does not call ReorderBufferCleanupSerializedTXNs.\nDrop slot flow:\npg_drop_replication_slot(PG_FUNCTION_ARGS) -\u0026gt; ReplicationSlotDrop(const char *name, bool nowait) -\u0026gt; ReplicationSlotDropAcquired(void) -\u0026gt; ReplicationSlotDropPtr\nReplicationSlotDropPtr\u0026rsquo;s slot cleanup logic is also interesting:\n/* * Permanently drop the replication slot which will be released by the point * this function returns. */ static void ReplicationSlotDropPtr(ReplicationSlot *slot) { char\tpath[MAXPGPATH]; char\ttmppath[MAXPGPATH]; /* * If some other backend ran this code concurrently with us, we might try * to delete a slot with a certain name while someone else was trying to * create a slot with the same name. */ LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE); /* Generate pathnames. */ sprintf(path, \u0026#34;pg_replslot/%s\u0026#34;, NameStr(slot-\u0026gt;data.name)); sprintf(tmppath, \u0026#34;pg_replslot/%s.tmp\u0026#34;, NameStr(slot-\u0026gt;data.name)); /* * Rename the slot directory on disk, so that we\u0026#39;ll no longer recognize * this as a valid slot. Note that if this fails, we\u0026#39;ve got to mark the * slot inactive before bailing out. If we\u0026#39;re dropping an ephemeral or a * temporary slot, we better never fail hard as the caller won\u0026#39;t expect * the slot to survive and this might get called during error handling. */ if (rename(path, tmppath) == 0) //rename file { /* * We need to fsync() the directory we just renamed and its parent to * make sure that our changes are on disk in a crash-safe fashion. If * fsync() fails, we can\u0026#39;t be sure whether the changes are on disk or * not. For now, we handle that by panicking; * StartupReplicationSlots() will try to straighten it out after * restart. */ //fsync persistence START_CRIT_SECTION(); fsync_fname(tmppath, true); fsync_fname(\u0026#34;pg_replslot\u0026#34;, true); END_CRIT_SECTION(); } ... /* * If removing the directory fails, the worst thing that will happen is * that the user won\u0026#39;t be able to create a new slot with the same name * until the next server restart. We warn about it, but that\u0026#39;s all. */ if (!rmtree(tmppath, true)) ereport(WARNING, (errmsg(\u0026#34;could not remove directory \\\u0026#34;%s\\\u0026#34;\u0026#34;, tmppath))); /* * We release this at the very end, so that nobody starts trying to create * a slot while we\u0026#39;re still cleaning up the detritus of the old one. */ LWLockRelease(ReplicationSlotAllocationLock); } Drop slot doesn\u0026rsquo;t directly unlink files in the slot directory. Instead, it first renames the slotname/ directory to slotname.tmp/, then unlinks the files inside, and finally removes the slotname.tmp/ directory itself.\nIn this, rmtree also loops to unlink files.\nAccelerated Startup Scheme After Replication Slot Spill # Deleting 10 million spill files is obviously very slow, but directly moving the directory (mv) is extremely fast. However, direct mv requires attention to the name after the move and the state file, as well as knowing which source code step the mv bypasses.\nmv Naming Notes # Since it was an abnormal shutdown, the startup process will execute SyncDataDirectory to fsync and stat all data files — this is hard to bypass. After SyncDataDirectory completes, it starts handling replication slots. When handling slots, it calls StartupReorderBuffer() -\u0026gt; ReorderBufferCleanupSerializedTXNs to fully clean up spill files.\nBefore entering cleanup, ReplicationSlotValidateName validates the slot name. We can exploit ReplicationSlotValidateName to trick the startup process into skipping the ReorderBufferCleanupSerializedTXNs process.\nReplicationSlotValidateName rules:\nbool ReplicationSlotValidateName(const char *name, int elevel) { ... for (cp = name; *cp; cp++) { //Key rule here if (!((*cp \u0026gt;= \u0026#39;a\u0026#39; \u0026amp;\u0026amp; *cp \u0026lt;= \u0026#39;z\u0026#39;) || (*cp \u0026gt;= \u0026#39;0\u0026#39; \u0026amp;\u0026amp; *cp \u0026lt;= \u0026#39;9\u0026#39;) || (*cp == \u0026#39;_\u0026#39;))) { ereport(elevel, (errcode(ERRCODE_INVALID_NAME), errmsg(\u0026#34;replication slot name \\\u0026#34;%s\\\u0026#34; contains invalid character\u0026#34;, name), errhint(\u0026#34;Replication slot names may only contain lower case letters, numbers, and the underscore character.\u0026#34;))); return false; } } return true; } Valid slot names only contain a-z, 0-9, _.\nSo when renaming, it\u0026rsquo;s recommended to add a dot .:\nRecommended: slotname.bak, slotname.20241215, etc. Not recommended: slotnamebackup, slotname20241215, slotname_bak, etc. Not recommended: .tmp suffix — slot names with .tmp have special meaning. After renaming, you need to create the directory and copy the state file, otherwise the slot will behave strangely on startup (e.g., duplicate slot names, auto-generated slot names, inability to delete slots, downstream unable to start the replication link, etc.).\nRecommended mv operations summarized:\ncd pg_replslot mv slotname slotname.bak mkdir slotname cp slotname.bak/state slotname/ Startup Time Comparison # Compare startup speed across different source code flows to see if manual mv/rm acceleration is actually meaningful.\nReference source logic principles:\nNormal shutdown: goes through fsync and stat Abnormal shutdown: goes through fsync and stat Valid mv: rename slot directory to .bak, skip unlink Invalid mv: rename slot directory to _bak, spill files start with xid, goes through unlink Since actual spill files would be too slow, I manually created fake slot directories and spill files: 50 slots total, 400k spills per slot, 20 million spills total, to test startup time (using cp directory is much faster than cp or dd files).\n# Test Plan Startup Time 1 Normal shutdown; no fsync/stat, no unlink 0.1 seconds 2 Normal shutdown, invalid mv; no fsync/stat, unlink 11 min 41 sec 3 Abnormal shutdown, valid mv; fsync/stat, no unlink 4 min 35 sec 4 Abnormal shutdown, invalid mv; fsync/stat, unlink 32 min 2 sec 5 Abnormal shutdown, rm (create slot dir, keep state) 13 min 4 sec Comparing plans 3 and 5, theoretically in the scenario at hand, a valid mv could achieve startup in about 4 minutes, while rm would take about 13 minutes. (This is a rough comparison; the recovery environment already showed some differences.)\n","date":"Jan 4, 2025","externalUrl":null,"permalink":"/en/2025/01/04/pg-startup-logic-and-spill-caused-slow-startup-analysis/","section":"Posts","summary":"Problem Symptom — Slow Startup # Version: PG 13.2\nDatabase startup was slow. The startup process was reading spill files, and the filenames kept changing. Checking the spill files was also very slow — ls -l eventually showed 8 million spill files.\nWhy Tens of Millions of Spill Files? # WAL Segment and LSN Meaning # LSN # LSN is a 64-bit bigint. An LSN actually looks like 42D3B/1732C540 (hex). Before the slash / is the 32-bit logical log number, and after the / are 32 bits split into segment number + block number + intra-block offset. These 4 parts are:\n","title":"PG Startup Logic and Spill-Caused Slow Startup Analysis","type":"posts"},{"content":"DDIA-v2 Chinese edition: https://github.com/Vonng/ddia/tree/v2\nAfter finishing DDIA-v2, I couldn\u0026rsquo;t put it down. Everything data-related is explained with such clarity — why is it like this? What\u0026rsquo;s the current state? What problems does this have? The observations and ideas are incredibly incisive and concise. Even the nautical-chart-style diagrams at the start of each chapter are fascinating.\nNote: This article is essentially a collection of excerpts from the original work, with almost none of my own thoughts or ideas. I\u0026rsquo;ve simply plucked out the parts I love most. Some knowledge I\u0026rsquo;ve already mastered and some topics too remote are skipped!\nCh1: Trade-offs in Data System Architecture # OLTP \u0026amp; OLAP # The distinction between OLTP and analytics is not always clear-cut, but the following table lists some typical characteristics:\nAttribute Transactional Systems (OLTP) Analytical Systems (OLAP) Primary read pattern Point queries (fetch individual records by key) Aggregation over a large number of records Primary write pattern Create, update, and delete individual records Bulk import (ETL) or event stream Human user example End users of web/mobile applications Internal analysts, for decision support Machine use example Check whether an action is authorized Detect fraud/abuse patterns Query type Fixed set of queries, predefined by the application Analysts can issue arbitrary queries Data representation Latest state of data (current point in time) History of events over time Dataset size GB, TB TB, PB A data warehouse is a separate database where analysts can query freely without affecting OLTP operations. Data warehouses typically store data in a very different way from OLTP databases, optimized for the query types common in analytics. The process of getting data into the data warehouse is called Extract–Transform–Load (ETL). Some database systems offer Hybrid Transaction/Analytical Processing (HTAP), aiming to enable both OLTP and analytics in a single system without ETL from one system to another. Despite the existence of HTAP, the separation between transactional and analytical systems remains common due to their differing goals and requirements. In particular, it is considered good practice for each business system to have its own database, resulting in hundreds of independent operational databases; on the other hand, an enterprise typically has only one data warehouse, allowing business analysts to combine data from several business systems in a single query. A data lake is a centralized data repository that holds any data potentially useful for analysis, sourced from business systems through ETL processes. Unlike a data warehouse, a data lake contains only files and imposes no specific file format or data model. Data warehouses typically use the relational data model and are queried via SQL. A data lakehouse goes beyond a standalone data warehouse by enabling typical data warehouse workloads (SQL queries and business analytics) as well as data science/machine learning workloads to run directly on files in the data lake. This architecture is called a data lakehouse. It requires a query execution engine and a metadata (e.g., schema management) layer to extend the file storage of the data lake. Apache Hive, Spark SQL, Presto, and Trino are examples of this approach.\nCloud Services vs. Self-Hosting # The pros and cons of cloud services: Using cloud services, rather than running comparable software yourself, is essentially outsourcing the operation of that software to a cloud provider. There are strong arguments both for and against using cloud services.\nAdvantages:\nWhen you use the cloud, you still need an operations team, but outsourcing basic system administration can free your team to focus on higher-level problems. Cloud services are especially valuable if your system load varies significantly over time. If you provision machines to handle peak load but those computing resources sit idle most of the time, your system becomes less cost-effective. Compared to physical machines, cloud instances can be provisioned faster and come in a wider variety of sizes. Disadvantages:\nThe biggest drawback of cloud services is that you have no control over them. If you already have experience setting up and operating the required systems and your load is fairly predictable (i.e., the number of machines you need won\u0026rsquo;t fluctuate dramatically), it is typically cheaper to buy your own machines and run the software yourself. If the service lacks a feature you need, your only option is to politely ask the vendor whether they\u0026rsquo;ll add it; you usually can\u0026rsquo;t implement it yourself. If the service goes down, you can only wait for it to recover. If you use the service in a way that triggers a bug or causes performance issues, it\u0026rsquo;s very difficult to diagnose the problem. With software you run yourself, you can obtain performance metrics and debugging information from the business system to understand its behavior, and you can inspect server logs. But with vendor-hosted services, you typically don\u0026rsquo;t have access to this internal information. Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to change its product in a way you don\u0026rsquo;t like, you\u0026rsquo;re at their mercy — continuing to run an old version of the software is usually not an option, so you\u0026rsquo;ll be forced to migrate to another service. This risk can be mitigated if there are alternative services offering compatible APIs, but for many cloud services, there is no standard API, which increases switching costs and makes vendor lock-in a real problem. Latency-critical applications such as high-frequency trading require complete control over hardware, making the cloud a poor choice for such businesses. Cloud-Native # Category Self-Hosted Systems Cloud-Native Systems Transactional/OLTP MySQL, PostgreSQL, MongoDB AWS Aurora, Azure SQL DB Hyperscale, Google Cloud Spanner Analytical/OLAP Teradata, ClickHouse, Spark Snowflake, Google BigQuery, Azure Synapse Analytics The key idea behind cloud-native services is not only to use computing resources managed by the business system but also to build on top of lower-level cloud services to create higher-level services. For example:\nObject storage services like Amazon S3, Azure Blob Storage, and Cloudflare R2 store large files. They provide a more limited API than a typical filesystem (basic file reads and writes), but their advantage is hiding the underlying physical machines: the service automatically distributes data across many machines, so you don\u0026rsquo;t need to worry about running out of disk space on any single machine. Even if some machines or their disks fail entirely, no data is lost. Many other services are in turn built on top of object storage and other cloud services: for example, Snowflake is a cloud-based analytical database (data warehouse) that relies on S3 for data storage, and some services are further built on top of Snowflake. Cloud-native systems are typically multi-tenant, meaning they don\u0026rsquo;t provision separate machines for each customer. Instead, data and computation from several different customers are handled by the same service on shared hardware. Multi-tenancy enables better hardware utilization, easier scalability, and simpler management for cloud providers.\nOperations in the Cloud Era # Traditionally, the people managing an organization\u0026rsquo;s server-side data infrastructure were called database administrators (DBAs) or system administrators (sysadmins). In recent years, many organizations have attempted to integrate software development and operations roles into a single team jointly responsible for backend services and data infrastructure; the DevOps philosophy has guided this trend. Site Reliability Engineers (SREs) represent Google\u0026rsquo;s implementation of this philosophy.\nThe DevOps/SRE philosophy emphasizes:\nAutomation — preferring repeatable processes over one-off manual tasks, Preferring ephemeral virtual machines and services over long-running servers, Promoting frequent application updates, Learning from incidents, Preserving organizational knowledge about systems even as individual personnel come and go. The operations team at an infrastructure company focuses on the details of reliably delivering services to a large number of customers, while the customers of the service spend as little time and energy on infrastructure as possible. Beyond the traditional need for capacity planning, adopting cloud services may be easier and faster than running your own infrastructure. While the cloud is changing the role of operations, the need for operations remains urgent.\nCh2: Defining Non-Functional Requirements # Hardware and Software Faults # In large-scale systems, hardware faults happen frequently enough that they become part of normal system operation:\nAbout 2-5% of disk hard drives fail each year; in a storage cluster with 10,000 disks, we can therefore expect on average one disk failure per day. About 0.5-1% of solid-state drives (SSDs) fail each year. Uncorrectable errors occur about once per drive per year. About one in 1,000 machines has a CPU core that occasionally computes incorrect results. Data in RAM can also be corrupted, due to random events like cosmic rays or permanent physical defects. Additionally, certain pathological memory access patterns can flip bits with high probability. Other hardware components such as power supplies, RAID controllers, and memory modules also fail. An entire data center can become unavailable (e.g., due to power outages or network misconfiguration) or even permanently destroyed (e.g., fire or flood). Software faults are often unpredictable and, because they are correlated across nodes, can cause more system failures than hardware faults:\nA bug that causes all application server instances to crash upon receiving a specific bad input. For example, the leap second on June 30, 2012, caused many applications to hang simultaneously due to a bug in the Linux kernel. A runaway process that exhausts some shared resource — CPU time, memory, disk space, or network bandwidth. A service that the system depends on becomes slow, unresponsive, or starts returning incorrect responses. Cascading failures, where a small fault in one component triggers a fault in another, which triggers further faults. Operational configuration errors are the leading cause of service outages, while hardware faults (server or network) account for only 10-25% of service outages.\nScalability Principles # A good general principle for scalability is to decompose the system into small components that can operate relatively independently. This is the basic principle behind microservices. However, the challenge lies in knowing where to draw the line between things that belong together and things that should be separate.\nIf a single-machine database can do the job, it may be preferable to a complex distributed setup. A system with five services is simpler than one with fifty services. Good architecture often involves a mix of approaches.\nOperations # An operations team is critical to keeping software systems running smoothly. The typical responsibilities of a good operations team include (and go beyond) the following:\nMonitoring system health and quickly restoring service when it degrades. Tracking down the causes of problems, such as system failures or performance degradation. Keeping software and platforms up to date, including security patches. Understanding interactions between systems to avoid damaging changes before they cause harm. Anticipating future problems and addressing them before they occur (e.g., capacity planning). Establishing good practices for deployment, configuration, and management, and writing supporting tools. Performing complex maintenance tasks, such as migrating applications from one platform to another. Maintaining system security during configuration changes. Defining workflows to make operations predictable and maintain production environment stability. Preserving organizational knowledge about systems as personnel come and go. Good operability means easier day-to-day work, allowing the operations team to focus on high-value tasks. Data systems can make routine tasks easier in various ways:\nProviding good monitoring with visibility into the system\u0026rsquo;s internal state and runtime behavior. Offering good support for automation, integrating the system with standardized tools. Avoiding dependence on a single machine (allowing machines to be taken down for maintenance while the overall system continues running uninterrupted). Providing good documentation and an easy-to-understand operational model (\u0026ldquo;if you do X, Y will happen\u0026rdquo;). Providing good default behavior but also allowing administrators to freely override defaults when needed. Self-healing when possible, but also allowing administrators to manually control system state when needed. Predictable behavior, minimizing surprises. Some aspects of operations can and should be automated, but setting up correctly functioning automation in the first place still depends on humans.\nSystems with too strong an individual stamp cannot succeed. When the initial design is complete and relatively stable, the real testing begins as different people test it in their own ways. — Donald Knuth\nCh3: Data Models and Query Languages # Most applications are built by layering one data model on top of another.\nAs an application developer, you observe the real world (with people, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures and APIs that manipulate those data structures. These structures are typically specific to your application. When you want to store these data structures, you express them in a general-purpose data model, such as JSON or XML documents, tables in a relational database, or vertices and edges in a graph. These data models are the subject of this chapter. The engineers who build your database software decided on a way to represent that JSON/relational/graph data as bytes in memory, on disk, or on the network. This representation may allow the data to be queried, searched, manipulated, and processed in various ways. We\u0026rsquo;ll discuss these storage engine designs in a later chapter. At an even lower level, hardware engineers have figured out how to represent bytes in terms of electric currents, light pulses, magnetic fields, and so on. SQL \u0026amp; NoSQL # Databases can execute declarative queries in parallel across multiple CPU cores and machines, without you needing to worry about how to implement that parallelism. Implementing such parallel execution yourself in hand-coded algorithms would be an enormous undertaking.\nThe relational model, despite being half a century old, remains an important data model for many applications — especially in data warehousing and business analytics, where relational star or snowflake schemas and SQL queries are ubiquitous. However, in other domains, several alternatives to relational data have become popular:\nThe document model targets use cases where data comes in the form of self-contained JSON documents and relationships between documents are rare. The graph data model goes in the opposite direction, targeting use cases where anything can be related to everything, and queries may need to traverse multiple hops to find data of interest (this can be expressed using recursive queries in Cypher, SPARQL, or Datalog). The dataframe generalizes relational data into a large number of columns, building a bridge between databases and the multidimensional arrays that form the foundation of most machine learning, statistical data analysis, and scientific computing. Databases also tend to expand into adjacent domains by adding support for other data models: for example, relational databases have added support for document data in the form of JSON columns, document databases have added relational-like joins, and support for graph data in SQL is gradually improving.\nCh4: Storage and Indexing # Hash Indexes # Key-value stores are quite similar to the dictionary type found in most programming languages, which is typically implemented using a hash map or hash table.\nGenerally, the hash map of a hash index is kept entirely in memory. Data values can use more space than available memory because the required portion can be loaded from disk with a single disk seek.\nDrawbacks of hash indexes:\nIn principle, a hash map can be maintained on disk. Unfortunately, disk-based hash maps struggle to perform well. They require a large amount of random-access I/O, are expensive to grow when exhausted, and require tedious logic to resolve hash collisions. Range queries are inefficient. For example, you can\u0026rsquo;t easily scan all keys between kitty00000 and kitty99999 — you must look up each key individually in the hash map. B-Tree Indexes # B-tree indexes have been around since 1970 and are widely accepted and used in the industry. This section is familiar to most readers — skipped.\nSSTables \u0026amp; LSM Trees # In hash indexes, the order of key-value pairs doesn\u0026rsquo;t matter. But we can require that the sequence of key-value pairs be sorted by key. This format is called a Sorted String Table, or SSTable.\nCompared to log segments using hash indexes, SSTables have several major advantages:\nEven if the file is larger than available memory, merging segments remains simple and efficient. The approach is like the one used in merge sort algorithms: you start reading multiple input files side by side, look at the first key in each file, copy the lowest key (according to the sort order) to the output file, and repeat. This produces a new merged segment file, also sorted by key. To find a particular key in the file, you no longer need to keep an index of all keys in memory. You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key per several kilobytes of segment file is sufficient, because several kilobytes can be scanned very quickly. Using these data structures, you can insert keys in any order and read them back in sorted order. Now we can make our storage engine work as follows:\nWhen a new write comes in, add it to an in-memory balanced tree data structure (e.g., a red-black tree). This in-memory tree is sometimes called a memtable. When the memtable becomes larger than some threshold (typically a few megabytes), write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While that SSTable is being written to disk, new writes can continue on a new memtable instance. When a read request comes in, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next older segment, and so on. From time to time, run a merging and compaction process in the background to combine segment files and discard overwritten or deleted values. The algorithm described here is essentially the technique used by LevelDB and RocksDB, key-value storage engine libraries designed to be embedded in other applications. Similar storage engines are used in Cassandra and HBase, and all of them were inspired by Google\u0026rsquo;s Bigtable paper (which introduced the terms SSTable and memtable).\nIn-Memory Databases # In-memory databases: As RAM becomes cheaper, the argument that RAM costs more per GB is eroding. Many datasets are not that large, so keeping them entirely in memory is quite feasible, including potentially distributed across multiple machines. This has led to the development of in-memory databases. Losing data when restarting a computer may be acceptable. Durability can also be achieved through special hardware (e.g., battery-backed RAM), by writing a change log to disk, by periodically writing snapshots to disk, or by replicating the in-memory state to other machines.\nThe typical in-memory database Redis provides weak durability through asynchronous writes to disk. Other in-memory databases include Memcached, VoltDB, MemSQL, Oracle TimesTen, and RAMCloud.\nCounterintuitively, the performance advantage of in-memory databases does not come from avoiding disk reads. Instead, they are faster because they avoid the overhead of encoding in-memory data structures into on-disk data structures.\nMaterialized Views and OLAP # Think of SQL functions like COUNT, SUM, AVG, MIN, or MAX. If the same aggregations are used by many different queries, it may be wasteful to process the raw data each time. Why not cache some of the most frequently used counts or sums? One way to create such a cache is a Materialized View.\nWhen the underlying data changes, a materialized view needs to be updated because it is a denormalized copy of the data. The database can do this automatically, but such updates make writes more expensive, which is why materialized views are not commonly used in OLTP databases. In read-heavy data warehouses, they may make more sense because warehouses don\u0026rsquo;t have many small, frequent updates.\nThe advantage of a materialized data cube is that it can make certain queries extremely fast because they have already been effectively precomputed. For example, if you want to know the total sales per store, you just look at the total along the appropriate dimension without scanning millions of rows of raw data.\nThe disadvantage of a data cube is that it lacks the flexibility of querying raw data. For example, there is no way to compute what proportion of sales came from items costing over $100, because price is not one of the dimensions. Therefore, most data warehouses try to keep as much raw data as possible and use aggregate data (like data cubes) only as a performance boost for certain queries.\nColumn-Oriented Storage # The idea behind column-oriented storage is simple: instead of storing all the values from one row together, store all the values from each column together. Column-oriented storage is easiest to understand in the relational data model, but it applies equally to non-relational data. For example, Parquet is a column-oriented storage format that supports a document data model based on Google\u0026rsquo;s Dremel.\nThese optimizations (column compression, sorting, etc.) make sense in data warehouses, where the workload consists mainly of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help read those queries faster. However, their drawback is that writes become more difficult.\nCh5: Encoding and Evolution # REST vs. RPC # Servers expose APIs over the network, and clients can connect to servers to make requests to those APIs. The API exposed by a server is called a service. Download data via GET requests, submit data to the server via POST requests.\nWhen a service uses HTTP as the underlying communication protocol, it can be called a web service. There are two popular approaches to web services: REST and SOAP. REST is not a protocol but a design philosophy based on HTTP principles. APIs designed according to REST principles are called RESTful.\nRemote Procedure Calls (RPC) are very different from local function calls:\nLocal function calls are predictable and succeed or fail based only on parameters under your control. Network requests are unpredictable: requests or responses may be lost due to network problems, or the remote machine may be slow or unavailable. A local function call either returns a result, throws an exception, or never returns (because it enters an infinite loop or the process crashes). A network request has another possible outcome: it may return with no result due to a timeout. And so on. REST seems to be the dominant style for public APIs, while RPC frameworks mainly focus on requests between services owned by the same organization, typically within the same data center.\nCh6: Replication # Replication logs, failover, single-leader mode — the content is relatively straightforward. Skipped.\nMulti-Leader Replication # Multi-leader replication is often a retrofitted feature in many databases, so it frequently has subtle configuration pitfalls and often interacts unexpectedly with other database features. For example, auto-increment primary keys, triggers, and integrity constraints can all cause trouble. Therefore, multi-leader replication is often considered dangerous territory and should be avoided whenever possible.\nHowever, multi-leader replication does have certain advantages, such as distributing write I/O, disaster recovery, and reducing network overhead in multi-region deployments (local writes), etc.\nWrite conflicts: The biggest problem with multi-leader replication is the potential for write conflicts, and resolving them is quite tricky. In principle, conflict detection could be made synchronous — i.e., wait for writes to be replicated to all replicas before telling the user the write succeeded. But this may defeat the purpose of multi-leader: if you want synchronous conflict detection, you might as well just use single-leader replication.\nResolving multi-leader write conflicts:\nAvoid conflicts. For example, have the application control that users only edit their own data. Converge to consistency: Last Write Wins (LWW). Write by timestamp — may result in data loss. Priority writes. Higher-priority writes win — may result in data loss. Extra code. Preserve conflict information and write custom conflict resolution code. Real-time collaborative editing applications allow multiple people to edit a document simultaneously — Etherpad and Google Docs are mature examples. Databases are still very young in the area of multi-leader writes.\nMulti-leader write conflicts in databases are mostly resolved or avoided at the application level. The following are relatively mature areas of write conflict research for reference:\nConflict-free Replicated Data Types (CRDTs) are data structures such as sets, maps, ordered lists, and counters that can be concurrently edited by multiple users and resolve conflicts automatically in a reasonable way. Some CRDTs have been implemented in Riak 2.0. Mergeable Persistent Data Structures explicitly track history, similar to the Git version control system, and use three-way merge functions (whereas CRDTs use two-way merges). Operational Transformation (OT) is the conflict resolution algorithm behind collaborative editing applications like Etherpad and Google Docs. It is designed specifically for concurrent editing of ordered lists, such as lists of characters that make up a text document. Ch7: Partitioning # Range Partitioning and Hash Partitioning # The drawback of range partitioning is that certain access patterns can lead to hot spots. If the primary key is a timestamp, partitions correspond to time ranges, and all writes will go to the same partition (i.e., today\u0026rsquo;s partition), which may become overloaded with writes while other partitions sit idle. You can use something other than the timestamp as the first part of the primary key to scatter the hot spot, but the drawback is that range queries won\u0026rsquo;t benefit.\nHash partitioning can mitigate the risk of skew and hot spots. For the purpose of partitioning, the hash function doesn\u0026rsquo;t need to be a cryptographically strong algorithm. The drawback of hash partitioning is that by partitioning by key hash, we lose a great property of key-range partitioning: the ability to efficiently execute range queries.\nHash partitioning can help reduce hot spots. But it cannot eliminate them entirely. For example, on a social media site, a celebrity user with millions of followers doing something can trigger a storm. This event can cause a large number of writes to the same key (the key might be the celebrity\u0026rsquo;s user ID or the ID of the action being commented on). Hash strategies don\u0026rsquo;t help here, because the hash of two identical IDs is still the same. If a primary key is very hot, a simple workaround is to add a random number at the beginning or end of the primary key. Just a two-digit decimal random number can scatter the primary key into 100 different primary keys, thus stored in different partitions. In any case, it\u0026rsquo;s about scattering hot spots, and you need to consider side effects such as the impact on range queries.\nCh8: Transactions # ACID, BASE # ACID is actually a very old definition. Due to the later discovery of many \u0026ldquo;anomalies,\u0026rdquo; a system claiming to guarantee ACID can\u0026rsquo;t actually articulate what exactly it guarantees. Whatever the case, ACID remains deeply ingrained — it represents the most fundamental principles of transactions. Conversely, systems that don\u0026rsquo;t meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft State, and Eventual Consistency. BASE is a concept commonly mentioned in the NoSQL world.\nThe definition of BASE is even fuzzier than ACID. A simple, easy-to-understand, easy-to-remember theory of BASE: BASE (which means \u0026ldquo;alkali\u0026rdquo; in chemistry) is the opposite of ACID (which means \u0026ldquo;acid\u0026rdquo;).\nYou can think of it simply this way:\nRelational databases Non-relational databases Transactions No transactions ACID BASE SQL NoSQL Atomicity and isolation within ACID are relatively easy to understand. The concept of consistency is actually quite vague and doesn\u0026rsquo;t seem closely related to the database itself. A quote in the book is very classic:\nJoe Hellerstein pointed out that in Härder and Reuter\u0026rsquo;s paper, \u0026ldquo;the C in ACID\u0026rdquo; was \u0026ldquo;tossed in to make the acronym work,\u0026rdquo; and at the time, nobody cared much about consistency.\nAnd the definition of isolation is very fuzzy. The industrial practice of serializability has also been stagnant. Transaction isolation can be described as \u0026ldquo;a mess,\u0026rdquo; but if serializability is a panacea, why does no one use it? Refer to this article: The History of Transactions and SSI\nAnomalies in non-serializable isolation levels generally only manifest under high concurrency; databases with low concurrency rarely encounter problems. When anomalies do occur, some applications may not notice them or may detect them but find them unimportant. Data may be anomalous, but the application may simply return an error and enter an anomaly-handling routine. Cost is too high. Not only is the development cost of serializable isolation levels high for databases, but applications also need adaptation costs for serializability. Just understanding this complex theory is no easy task. Higher isolation levels lose some performance. Massive rework may be thankless; applications need to choose between \u0026ldquo;high concurrency\u0026rdquo; and \u0026ldquo;no anomalies.\u0026rdquo; Businesses develop based on mechanisms, not rules. Businesses have somewhat adapted to the anomalies of weaker isolation levels, especially Read Committed. Summed up in one sentence: It\u0026rsquo;s not like it\u0026rsquo;s unusable!\nPessimistic and Optimistic Transaction Models # Two-phase locking is a so-called pessimistic concurrency control mechanism: it is based on the principle that if something might go wrong (e.g., another transaction holding a lock), it\u0026rsquo;s better to wait until the situation is safe before proceeding. It\u0026rsquo;s like a mutex used to protect data structures in multi-threaded programming.\nIn a sense, serial execution could be called the ultimate in pessimism: for the duration of each transaction, each transaction holds an exclusive lock on the entire database (or a partition of the database). As compensation for the pessimism, we make each transaction execute very fast, so the \u0026ldquo;lock\u0026rdquo; is only held for a short time.\nIn contrast, Serializable Snapshot Isolation is an optimistic concurrency control technique. In this context, optimistic means that if there is potential danger, the transaction is not blocked — instead, it continues executing, hoping everything will turn out fine. When a transaction wants to commit, the database checks whether anything bad happened (i.e., whether isolation was violated); if so, the transaction is aborted and must be retried. Only serializable transactions are allowed to commit. If there is a lot of contention (i.e., many transactions trying to access the same objects), performance suffers because a large proportion of transactions need to be aborted. If the system is already near maximum throughput, the additional load from retried transactions can worsen performance.\nCh9: Distributed Systems # Clocks # Clocks are critically important in distributed systems — they can directly affect the visibility, isolation, and correctness of transactions. In reality, reading a precise point in time is meaningless (from a quantum theory perspective, there is no concept of an absolute point in time; the actual situation is even more complex). Spanner\u0026rsquo;s Google TrueTime API reports a confidence interval for the local clock. The confidence interval reports an extremely short and trustworthy time range rather than a time point. For example, if you have two confidence intervals, each containing the earliest and latest possible timestamps ($A = [A_{earliest}, A_{latest}]$, $B=[B_{earliest}, B_{latest}]$), and these two intervals do not overlap (i.e., $A_{earliest} \u0026lt; A_{latest} \u0026lt; B_{earliest} \u0026lt; B_{latest}$), then B definitely happened after A — there is no doubt. Only when the intervals overlap are we uncertain about the order in which A and B occurred. To ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. To keep the clock uncertainty as small as possible, Google deploys a GPS receiver or atomic clock in every data center, allowing clocks to be synchronized to within about 7 milliseconds. Logical clocks are based on incrementing counters rather than oscillating quartz crystals. Logical clocks only measure the relative ordering of events.\nReal time may not exist. Responsiveness trumps everything. For most server-side data processing systems, real-time guarantees are uneconomical or unsuitable. Therefore, these systems must endure pauses and clock instability in non-real-time environments.\nCh10: Consistency and Consensus # All the problems we\u0026rsquo;ve assumed are possible: packets in the network can be lost, reordered, duplicated, or arbitrarily delayed; clocks are at best approximate; and nodes can pause (e.g., due to garbage collection) or crash at any time.\nCAP # The formal definition of the CAP theorem is limited to a very narrow scope — it only considers one consistency model (linearizability) and one type of fault (network partitions, or nodes that are alive but disconnected from each other). It doesn\u0026rsquo;t discuss anything about network delays, dead nodes, or other trade-offs. Therefore, despite CAP\u0026rsquo;s historical influence, it has no practical value for designing systems.\nDistributed Transactions and Consensus # All the consensus protocols discussed so far internally use a leader in some form, but they don\u0026rsquo;t guarantee that the leader is unique. Instead, they make a weaker guarantee: the protocol defines an epoch number (called ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and ensures that within each epoch, the leader is unique. Whenever the current leader is thought to be dead, a vote begins among the nodes to elect a new leader. This election is assigned an incrementing epoch number, so epoch numbers are totally ordered and monotonically increasing. If there is a conflict between leaders from two different epochs (perhaps because the previous leader hadn\u0026rsquo;t actually died), the leader with the higher epoch number prevails. Designing algorithms that robustly cope with unreliable networks remains an open research problem.\nCh11: Batch Processing # Services (online systems) Services wait for requests or instructions from clients to arrive. Upon receiving one, the service attempts to process it as quickly as possible and sends back a response. Response time is typically the primary performance metric for services, and availability is usually very important (if clients can\u0026rsquo;t reach the service, users may receive error messages).\nBatch processing systems (offline systems) A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data. This often takes a while (from minutes to days), so typically no user is waiting for the job to finish. Instead, batch jobs typically run periodically (e.g., once a day). The primary performance metric for batch jobs is typically throughput (the time needed to process input of a certain size). This chapter discusses batch processing.\nStream processing systems (near-real-time systems) Stream processing sits between online and offline (batch) processing, so it is sometimes called near-real-time or nearline processing. Like batch processing systems, stream processing consumes inputs and produces outputs (without needing to respond to requests). However, stream jobs operate on events shortly after they occur, whereas batch jobs wait for a fixed set of input data. This difference gives stream processing systems lower latency compared to batch processing systems.\nThe batch processing algorithm MapReduce, published in 2004, was (perhaps over-enthusiastically) called \u0026ldquo;the algorithm that made Google\u0026rsquo;s massive scalability possible.\u0026rdquo; MapReduce is a fairly low-level programming model.\nMapReduce and Distributed File Systems # Compared to the query optimizer of a relational database, Unix tools, despite their simplicity, are still remarkably useful. The biggest limitation of Unix tools is that they can only run on a single machine — this is where tools like Hadoop came in. MapReduce is somewhat like Unix tools but distributed across thousands of machines. Like Unix tools, it\u0026rsquo;s fairly crude but surprisingly effective. MapReduce jobs read and write files on a distributed file system. In Hadoop\u0026rsquo;s implementation of MapReduce, this file system is called HDFS (Hadoop Distributed File System), an open-source implementation of the Google File System (GFS). Besides HDFS, there are various other distributed file systems such as GlusterFS and the Quantcast File System (QFS). Object storage services like Amazon S3, Azure Blob Storage, and OpenStack Swift are similar in many ways.\nTo create a MapReduce job, you need to implement two callback functions, Mapper and Reducer, which behave as follows:\nMapper The Mapper is called once on each input record. Its job is to extract key-value pairs from the input record. For each input, it can generate any number of key-value pairs (including none). It does not retain any state from one input record to the next, so each record is processed independently.\nReducer The MapReduce framework takes the key-value pairs produced by the Mapper, collects all values belonging to the same key, and iteratively calls the Reducer over this set of values. The Reducer can produce output records (e.g., the count of occurrences of the same URL).\nUsing the MapReduce programming model, the physical network communication aspects of computation (getting data from the right machines) are separated from the application logic (processing the data after obtaining it). This separation contrasts sharply with the typical use of databases, where requests to fetch data from the database frequently appear within application code. Because MapReduce handles all network communication, it also frees application code from worrying about partial failures, such as the crash of another node: MapReduce can transparently retry failed tasks without affecting application logic.\nAnother common pattern of \u0026ldquo;putting related data together\u0026rdquo; is grouping records by some key (like the GROUP BY clause in SQL). The simplest way to implement this grouping operation with MapReduce is to set up the Mapper so that the key-value pairs it generates use the desired grouping key. The partitioning and sorting process then directs all records with the same partition key to the same Reducer.\nHadoop vs. Distributed Databases # As we\u0026rsquo;ve seen, Hadoop is somewhat like a distributed version of Unix, where HDFS is the file system and MapReduce is a peculiar implementation of Unix processes (always running the sort utility between the Map and Reduce phases). We\u0026rsquo;ve seen how various join and grouping operations can be implemented on top of these primitives.\nWhen the MapReduce paper was published, it was — in a sense — not new. All the processing and parallel join algorithms we discussed in earlier sections had already been implemented over a decade earlier in so-called massively parallel processing (MPP) databases. Examples include the Gamma database machine, Teradata, and Tandem NonStop SQL, which were pioneers in this area.\nThe biggest difference is that MPP databases focus on executing analytical SQL queries in parallel across a set of machines, whereas the combination of MapReduce and a distributed file system is more like a general-purpose operating system that can run arbitrary programs.\nDiversity of Processing Models # Having only two processing models, SQL and MapReduce, is not enough — more diverse models are needed! And due to the openness of the Hadoop platform, implementing a whole range of approaches is feasible, something that was impossible within the monolithic MPP database paradigm. Traditionally, MPP databases met the needs of business intelligence analytics and business reporting, but this is only one of many domains that use batch processing. In the years since MapReduce became popular, execution engines for distributed batch processing have matured significantly.\nCh12: Stream Processing # Skipped.\nEvent Sourcing # Event sourcing is a powerful data modeling technique: from the application\u0026rsquo;s perspective, it\u0026rsquo;s more meaningful to record user actions as immutable events rather than recording the effects of those actions in a mutable database. Event sourcing is similar to the chronicle data model. Like change data capture, event sourcing involves storing all changes to application state as a log of change events. Applications using event sourcing need to pull the event log (representing the data written to the system) and transform it into application state suitable for display to users. The current state is derived from the event log.\nCh13: The Future of Data Systems # Lambda Architecture # If batch processing is used to reprocess historical data and stream processing is used for recent updates, how do we combine the two? The Lambda Architecture is one proposal for this. The core idea of the Lambda Architecture is to record incoming data by appending immutable events to an ever-growing dataset, similar to event sourcing. In the Lambda approach, the stream processor consumes events and quickly produces an approximate update to the view; the batch processor later uses the same set of events and produces a corrected version of the derived view.\nUnix evolved pipelines and files that are just byte sequences, while databases evolved SQL and transactions. Which approach is better? Of course, it depends on what you want. Unix is \u0026ldquo;simple\u0026rdquo; because it\u0026rsquo;s a fairly thin wrapper around hardware resources; relational databases are \u0026ldquo;simpler\u0026rdquo; because a short declarative query can leverage a lot of powerful infrastructure (query optimization, indexes, join methods, concurrency control, replication, etc.) without requiring the query author to understand the implementation details. I interpret the NoSQL movement as a desire to apply Unix-like low-level abstractions to the domain of distributed OLTP data storage.\nSeparation of Application Code and State # In theory, a database could be a deployment environment for arbitrary application code, much like an operating system. In practice, however, they are poorly suited to this goal. They don\u0026rsquo;t meet the requirements of modern application development, such as dependency and package management, version control, rolling upgrades, evolvability, monitoring, metrics, calls to network services, and integration with external systems. I believe it makes sense to have some parts of the system specialized for persistent data storage and other parts specialized for running application code. The two can interact while remaining independent. The trend is to separate stateless application logic from state management (databases): don\u0026rsquo;t put application logic into the database, and don\u0026rsquo;t put persistent state into the application.\nI assert that in most applications, integrity is far more important than timeliness. Violating timeliness may be confusing and annoying, but violating integrity can be catastrophic.\nProblems Introduced by Algorithms # Bias and discrimination: For example, in racially segregated areas, a person\u0026rsquo;s ZIP code, or even their IP address, is a strong indicator of race. Given this, it seems absurd to believe that an algorithm can somehow take biased data as input and produce fair and unbiased output. Yet this view often seems to lurk among advocates of data-driven decision-making — an attitude satirized as \u0026ldquo;machine learning is like money laundering for bias.\u0026rdquo; Predictive analytics systems simply extrapolate from the past; if the past was discriminatory, they codify that discrimination into rules.\nResponsibility and accountability: Automated decision-making raises questions about responsibility and accountability. If a person makes a mistake, they can be held accountable, and those affected by the decision can appeal. Algorithms also make mistakes, but when they do, who is responsible?\nPrivacy and surveillance: Let\u0026rsquo;s do a thought experiment. Try replacing the word data with surveillance and see if common phrases still sound as nice. For example: \u0026ldquo;In our surveillance-driven organization, we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing to gain new insights.\u0026rdquo;\nBlind faith in the supremacy of data-driven decisions is not just delusional — it\u0026rsquo;s genuinely dangerous. As data-driven decision-making becomes more prevalent, we need to figure out how to make algorithms more accountable and transparent, how to avoid reinforcing existing biases, and how to fix them when they inevitably err.\nUsers barely know what data they\u0026rsquo;re giving us, what data goes into the database, and how the data is retained and processed — most privacy policies are ambiguous, stringing users along without coming clean. If users don\u0026rsquo;t understand what will happen to their data, they can\u0026rsquo;t give any meaningful consent. For users who disagree with surveillance, the only truly viable alternative is simply not to use the service. But this choice isn\u0026rsquo;t truly free either: if a service is so popular that it is \u0026ldquo;considered a necessity for basic social participation by most,\u0026rdquo; then expecting people to opt out is unreasonable — using it is effectively mandatory.\nSummary # Since software and data have such an enormous impact on the world, we engineers must remember that we have a responsibility to work toward the kind of world we want: a world that respects people, that respects humanity. I hope we can work together toward that goal.\n","date":"Sep 20, 2024","externalUrl":null,"permalink":"/en/2024/09/20/book-notes-designing-data-intensive-applications-2nd-edition/","section":"Posts","summary":"DDIA-v2 Chinese edition: https://github.com/Vonng/ddia/tree/v2\nAfter finishing DDIA-v2, I couldn’t put it down. Everything data-related is explained with such clarity — why is it like this? What’s the current state? What problems does this have? The observations and ideas are incredibly incisive and concise. Even the nautical-chart-style diagrams at the start of each chapter are fascinating.\nNote: This article is essentially a collection of excerpts from the original work, with almost none of my own thoughts or ideas. I’ve simply plucked out the parts I love most. Some knowledge I’ve already mastered and some topics too remote are skipped!\n","title":"Book Notes — Designing Data-Intensive Applications (2nd Edition)","type":"posts"},{"content":"","date":"Sep 20, 2024","externalUrl":null,"permalink":"/en/categories/%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0/","section":"Categories","summary":"","title":"读书笔记","type":"categories"},{"content":"Among all relational databases, PostgreSQL\u0026rsquo;s CLOG is a very special type of log. CLOG\u0026rsquo;s existence is inseparable from PostgreSQL\u0026rsquo;s MVCC mechanism. Some basic knowledge about transaction IDs and CLOG won\u0026rsquo;t be covered in this article. If interested, please refer to CLOG and Hint Bits. This article focuses on the structure of CLOG files, manually locating transaction states, and the CLOG WAL log synchronization mechanism, to further understand PostgreSQL\u0026rsquo;s CLOG.\nCLOG Segment # CLOG Directory # To distinguish from regular logs, PostgreSQL 10 renamed the CLOG and WAL directories 1:\npg9.6 pg10 pg_clog pg_xact pg_xlog pg_wal Don\u0026rsquo;t get confused — I was also troubled by pg_xlog and pg_xact for a while\u0026hellip;\nCLOG Segment Name # CLOG is also managed by SLRU, and CLOG file naming is also in slru.c:\n#define SlruFileName(ctl, path, seg) \\ snprintf(path, MAXPGPATH, \u0026#34;%s/%04X\u0026#34;, (ctl)-\u0026gt;Dir, seg) %04X means hexadecimal (X), width of 4, zero-padded on the left (04). Example CLOG filenames:\n[pg_xact]$ ll -rw------- 1 postgres postgres 262144 Aug 15 16:29 03C0 -rw------- 1 postgres postgres 262144 Aug 19 23:04 03C1 ... TransactionID and CLOG Location Conversion # CLOG only stores transaction ID status, not the transaction ID itself. Through the TransactionID itself, you can directly locate the CLOG file and the position within the file. Before that, we need to understand some fundamentals.\nTransaction States Stored in CLOG # There are only 4 transaction states:\ntypedef int XidStatus; #define TRANSACTION_STATUS_IN_PROGRESS\t0x00 #define TRANSACTION_STATUS_COMMITTED\t0x01 #define TRANSACTION_STATUS_ABORTED\t0x02 #define TRANSACTION_STATUS_SUB_COMMITTED\t0x03 Transaction states are only: in progress, committed, aborted, subtransaction committed. Note that transaction IDs don\u0026rsquo;t have an \u0026ldquo;not started\u0026rdquo; state — as soon as a transaction ID is allocated in the database, that transaction has definitely already started. Conversely, transaction IDs not yet allocated in the database (actually a few — see the extend CLOG section below) correspond to in_progress status in CLOG. Four transaction states actually only need 2 bits to store. So 1 byte (8 bits) can store 4 transaction states, and 1 page (8k) can hold 8KB*4=32768 transaction states. These are all defined in the source code:\n* Defines for CLOG page sizes. A page is the same BLCKSZ as is used * everywhere else in Postgres. // CLOG page size = BLCKSZ = 8k (default) #define CLOG_BITS_PER_XACT\t2 // One transaction state occupies 2 bits #define CLOG_XACTS_PER_BYTE 4 // 1 byte can hold 4 transaction states #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE) // 1 page can hold 32768 transaction states, 8KB*4=32768 #define CLOG_XACT_BITMASK ((1 \u0026lt;\u0026lt; CLOG_BITS_PER_XACT) - 1) // Transaction status bitmask = ((1\u0026lt;\u0026lt;2)-1) = 3, expressed in binary as 11 #define SLRU_PAGES_PER_SEGMENT\t32 // 1 segment has 32 pages Summary:\n1 CLOG segment has 32 pages 1 CLOG page is 8k (typically) 1 byte has 4 transaction states 1 transaction state occupies 2 bits CLOG Segment/Page/Byte Conversion # Finding which CLOG segment a transaction ID corresponds to is not easy — it\u0026rsquo;s hidden in the comments:\n* Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF, * CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE, * and CLOG segment numbering at * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT // segment number = xid/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT = xid/32768/32 // Which CLOG segment the transaction ID corresponds to, xid/32768/32, needs to be converted to hex Mapping transaction ID to page, byte, etc. is clearer 2:\n#define TransactionIdToPage(xid)\t((xid) / (TransactionId) CLOG_XACTS_PER_PAGE) // Which CLOG page the transaction ID corresponds to, xid/32768 #define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) // The offset within the above page, xid%32768 #define TransactionIdToByte(xid)\t(TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) // Which byte in the page the transaction ID corresponds to, (xid%32768)/4 #define TransactionIdToBIndex(xid)\t((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)\t// Which bit index in the above byte (note: bit index, not the bit itself), xid%4 Generally (with 8k BLCKSZ), 1 CLOG segment has 32 pages; 1 CLOG segment has 328k bytes, i.e., CLOG file size is fixed at 256K; 1 CLOG segment can hold 432*8k transaction states.\n[pg_xact]$ ll # 256k CLOG segment -rw------- 1 postgres postgres 262144 Aug 15 16:29 03C0 -rw------- 1 postgres postgres 262144 Aug 19 23:04 03C1 ... CLOG Bit Conversion # The functions for setting CLOG bits and getting CLOG bits (corresponding to TransactionIdSetStatusBit and TransactionIdGetStatus) both have the following code to obtain which two bits in the CLOG the transaction ID corresponds to:\nint\tbshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; char\t*byteptr; ... byteptr = XactCtl-\u0026gt;shared-\u0026gt;page_buffer[slotno] + byteno; curval = (*byteptr \u0026gt;\u0026gt; bshift) \u0026amp; CLOG_XACT_BITMASK; bshift represents the right-shift position, where TransactionIdToBIndex=xid%4, CLOG_BITS_PER_XACT=2, CLOG_XACT_BITMASK=3 (binary: 11). The key code for getting CLOG bits curval = (*byteptr \u0026gt;\u0026gt; bshift) \u0026amp; CLOG_XACT_BITMASK can be understood in two parts:\n*byteptr \u0026gt;\u0026gt; bshift means right-shifting the pointer by 0, 2, 4, or 6 bits \u0026amp; CLOG_XACT_BITMASK is simply taking the last two bits after the right shift (00\u0026amp;11=00, 01\u0026amp;11=01, 10\u0026amp;11=10, 11\u0026amp;11=11) So, calculating the position of a transaction ID\u0026rsquo;s state within a byte:\nxid%4=0: takes bits 7 and 8 xid%4=1: takes bits 5 and 6 xid%4=2: takes bits 3 and 4 xid%4=3: takes bits 1 and 2 Note: the transaction ID state\u0026rsquo;s bit positions within a byte are taken in reverse order, not sequentially forward. Byte and page positions are taken in sequential increasing order.\nManually Calculating Transaction ID Position in CLOG File # If we want to manually locate a transaction in CLOG using hexdump, we need to calculate three elements: \u0026lt;CLOG segment number, offset within segment in bytes, offset on byte in bit index\u0026gt;. (This references the approach in \u0026ldquo;PostgreSQL Database Kernel Analysis\u0026rdquo; but with some differences 3)\nBefore calculating, you also need to understand:\nCLOG segment file numbers are in hexadecimal hexdump is in hexadecimal, each line holds 16 bytes, i.e., each line holds 16*CLOG_XACTS_PER_BYTE=16*4=64 transaction states hexdump -s xxx is in byte units The following SQL can calculate the position of a transaction ID in CLOG:\n-- CLOG segment number -- %4294967296 represents transaction ID wraparound, /(8192*4*32) represents the maximum number of transactions a segment file can contain, to_hex converts to hex for filename, lpad left-pads to 4 digits select lpad(upper(to_hex(txid_current()%4294967296/(8192*4*32))),4,\u0026#39;0\u0026#39;) as clog_segmentno; -- Offset within segment in bytes -- %4294967296 represents transaction ID wraparound, %(8192*32*4) takes the remaining transaction IDs, /4 converts to byte units select txid_current()%4294967296%(8192*32*4)/4 as in_clog_offset_bytes; -- Offset on byte in bit index -- %4294967296 represents transaction ID wraparound, %4 takes the bit index within the byte select txid_current()%4294967296%4 as in_byte_offset_bitindex; -- Or a single SQL select lpad(upper(to_hex(txid_current()%4294967296/(8192*4*32))),4,\u0026#39;0\u0026#39;) as clog_segmentno,txid_current()%4294967296%(8192*32*4)/4 as in_clog_offset_bytes,txid_current()%4294967296%4 as in_byte_offset_bitindex; Practical simulation — computing a transaction ID\u0026rsquo;s state in CLOG:\nbegin; select lpad(upper(to_hex(txid_current()%4294967296/(8192*4*32))),4,\u0026#39;0\u0026#39;) as clog_segmentno,txid_current()%4294967296%(8192*32*4)/4 as in_clog_offset_bytes,txid_current()%4294967296%4 as in_byte_offset_bitindex; clog_segmentno | in_clog_offset_bytes | in_byte_offset_bitindex ----------------+----------------------+------------------------- 0002 | 63196 | 3 rollback; checkpoint; Rollback is used to roll back the transaction, mainly for easier observation, since most transactions are committed. Checkpoint is to ensure the CLOG page is flushed — otherwise the CLOG page might still be in the CLOG buffer and not yet written to the CLOG segment file.\ncd pg_xact/ hexdump -C 0002 -s 63196 -n 1 -v 0000f6dc 95 |.| 0000f6dd -- Convert hex to binary \u0026gt; select \u0026#39;x96\u0026#39;::bit(8); bit ---------- 10010110 When xid%4=3, take bits 1 and 2. So the bit value for this rolled-back transaction is 10, where 10 represents TRANSACTION_STATUS_ABORTED.\nWhy CLOG Usually Contains Many 55s and U\u0026rsquo;s? # In a typical transactional database CLOG file, a direct hexdump looks like this:\nhexdump -C 0001 -v|head -10 00000000 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000010 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000020 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000030 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000040 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000050 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000060 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000070 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000080 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| 00000090 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 |UUUUUUUUUUUUUUUU| Because the committed transaction state = 01 = TRANSACTION_STATUS_COMMITTED. When 4 consecutive transactions in a byte are all committed, it becomes 01010101.\nBinary: 01010101, hex: 55 Hex 55 in ASCII is \u0026lsquo;U\u0026rsquo;, so when visually examining CLOG files you can generally see many U\u0026rsquo;s Occasionally some bytes are not 55 or U because in production environments some transactions occasionally haven\u0026rsquo;t completed or use subtransactions. The committed state of subtransactions in CLOG is 0x03. Shared CLOG Buffer # The number of CLOG shared buffers is easy to understand:\n/* * Number of shared CLOG buffers. * * On larger multi-processor systems, it is possible to have many CLOG page * requests in flight at one time which could lead to disk access for CLOG * page if the required page is not found in memory. Testing revealed that we * can get the best performance by having 128 CLOG buffers, more than that it * doesn\u0026#39;t improve performance. * * Unconditionally keeping the number of CLOG buffers to 128 did not seem like * a good idea, because it would increase the minimum amount of shared memory * required to start, which could be a problem for people running very small * configurations. The following formula seems to represent a reasonable * compromise: people with very low values for shared_buffers will get fewer * CLOG buffers as well, and everyone else will get 128. */ Size CLOGShmemBuffers(void) { return Min(128, Max(4, NBuffers / 512)); } Translation: Testing has shown that 128 CLOG buffers provide the best performance — more than that doesn\u0026rsquo;t improve performance. However, because some database configurations are too small, 128 CLOG buffers seems a bit large, so it takes 1/512 of the shared_buffers count. In other words: Number of CLOG buffers = 1/512 shared_buffer, minimum is 4, maximum is 128. Note: these are all buffer counts, not sizes!\nHow large is a single buffer? CLOG buffer is managed by SLRU, and each SLRU page is 8k:\nA page is the same BLCKSZ as is used everywhere\nWe can glimpse the size of shared CLOG buffer from the perspective of CLOG SLRU initialization:\n/* * Initialization of shared memory for CLOG */ Size CLOGShmemSize(void) { return SimpleLruShmemSize(CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE); } The passed CLOGShmemBuffers() is 4~128, and the passed CLOG_LSNS_PER_PAGE = 1024 bytes (with 8k pages). SimpleLruShmemSize initializes SLRU shared memory:\nSize SimpleLruShmemSize(int nslots, int nlsns) { Size\tsz; /* we assume nslots isn\u0026#39;t so large as to risk overflow */ sz = MAXALIGN(sizeof(SlruSharedData)); sz += MAXALIGN(nslots * sizeof(char *));\t/* page_buffer[] */ sz += MAXALIGN(nslots * sizeof(SlruPageStatus));\t/* page_status[] */ sz += MAXALIGN(nslots * sizeof(bool));\t/* page_dirty[] */ sz += MAXALIGN(nslots * sizeof(int));\t/* page_number[] */ sz += MAXALIGN(nslots * sizeof(int));\t/* page_lru_count[] */ sz += MAXALIGN(nslots * sizeof(LWLockPadded));\t/* buffer_locks[] */ if (nlsns \u0026gt; 0) sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));\t/* group_lsn[] */ return BUFFERALIGN(sz) + BLCKSZ * nslots; } SLRU uses some arrays to store SLRU metadata and control information. The sz size is all roughly data type * buffer count, and these are generally not very large. The main initialized memory is BLCKSZ * nslots, i.e., 8k * (4~128) = (32k~1M). So we can roughly estimate that the shared CLOG buffer size is around 1M.\nCLOG WAL: Types, Writing, and Redo # When writing CLOG, is CLOG WAL log also written? If so, wouldn\u0026rsquo;t that mean lost CLOG could be restored by reapplying WAL logs to recover transaction states? Let\u0026rsquo;s explore the CLOG WAL writing and redo source code with these questions in mind.\nExtend CLOG # ZeroCLOGPage writes WAL. ZeroCLOGPage(pageno, true) is actually only called by ExtendCLOG:\n/* * Make sure that CLOG has room for a newly-allocated XID. * * NB: this is called while holding XidGenLock. We want it to be very fast * most of the time; even when it\u0026#39;s not so fast, no actual I/O need happen * unless we\u0026#39;re forced to write out a dirty clog or xlog page to make room * in shared memory. */ void ExtendCLOG(TransactionId newestXact) { int\tpageno; /* * No work except at first XID of a page. But beware: just after * wraparound, the first XID of page zero is FirstNormalTransactionId. */ if (TransactionIdToPgIndex(newestXact) != 0 \u0026amp;\u0026amp; !TransactionIdEquals(newestXact, FirstNormalTransactionId)) return; pageno = TransactionIdToPage(newestXact); // CLOG page number converted from TransactionId LWLockAcquire(XactSLRULock, LW_EXCLUSIVE); /* Zero the page and make an XLOG entry about it */ ZeroCLOGPage(pageno, true); LWLockRelease(XactSLRULock); } ZeroCLOGPage mainly calls WriteZeroPageXlogRec:\n/* * Write a ZEROPAGE xlog record */ static void WriteZeroPageXlogRec(int pageno) { XLogBeginInsert(); XLogRegisterData((char *) (\u0026amp;pageno), sizeof(int)); (void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE); } WriteZeroPageXlogRec is writing a WAL record, with type \u0026ldquo;RM_CLOG_ID, CLOG_ZEROPAGE\u0026rdquo;. Using waldump, you can view CLOG_ZEROPAGE. Its proportion is generally very small:\npg_waldump -z 000000010000056B00000018 --stat=record Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- ... CLOG/ZEROPAGE 1 ( 0.00) 30 ( 0.00) 0 ( 0.00) 30 ( 0.00) ... Extending CLOG page is always in page units. In fact, at the end of a CLOG segment you can easily see 00s:\nhexdump 03C2 0000000 5555 5555 5555 5555 5555 5555 5555 5555 * 001bb30 5555 5555 0055 0000 0000 0000 0000 0000 001bb40 0000 0000 0000 0000 0000 0000 0000 0000 * ## The end of the CLOG file is all zeros 001c000 Truncate CLOG # Besides extending CLOG, there\u0026rsquo;s also truncating CLOG. Truncate CLOG is called during vacuum. When called, it writes a truncate CLOG WAL record and flushes the WAL record to disk:\n/* * Remove all CLOG segments before the one holding the passed transaction ID * * Before removing any CLOG data, we must flush XLOG to disk, to ensure * that any recently-emitted FREEZE_PAGE records have reached disk; otherwise * a crash and restart might leave us with some unfrozen tuples referencing * removed CLOG data. We choose to emit a special TRUNCATE XLOG record too. * Replaying the deletion from XLOG is not critical, since the files could * just as well be removed later, but doing so prevents a long-running hot * standby server from acquiring an unreasonably bloated CLOG directory. * * Since CLOG segments hold a large number of transactions, the opportunity to * actually remove a segment is fairly rare, and so it seems best not to do * the XLOG flush unless we have confirmed that there is a removable segment. */ void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid) { int\tcutoffPage; /* * The cutoff point is the start of the segment containing oldestXact. We * pass the *page* containing oldestXact to SimpleLruTruncate. */ // What\u0026#39;s written to WAL is the CLOG position, which is the CLOG page number converted from oldestXact cutoffPage = TransactionIdToPage(oldestXact); ..... /* * Write XLOG record and flush XLOG to disk. We record the oldest xid * we\u0026#39;re keeping information about here so we can ensure that it\u0026#39;s always * ahead of clog truncation in case we crash, and so a standby finds out * the new valid xid before the next checkpoint. */ // WriteTruncateXlogRec writes the corresponding WAL record and flushes it to disk WriteTruncateXlogRec(cutoffPage, oldestXact, oldestxid_datoid); // After WAL is written, actually execute the CLOG segment truncation /* Now we can remove the old CLOG segment(s) */ SimpleLruTruncate(XactCtl, cutoffPage); } WriteTruncateXlogRec writes a WAL record with RMGR as RM_CLOG_ID and info as CLOG_TRUNCATE:\n/* * Write a TRUNCATE xlog record * * We must flush the xlog record to disk before returning --- see notes * in TruncateCLOG(). */ static void WriteTruncateXlogRec(int pageno, TransactionId oldestXact, Oid oldestXactDb) { XLogRecPtr\trecptr; xl_clog_truncate xlrec; xlrec.pageno = pageno; xlrec.oldestXact = oldestXact; xlrec.oldestXactDb = oldestXactDb; XLogBeginInsert(); XLogRegisterData((char *) (\u0026amp;xlrec), sizeof(xl_clog_truncate)); recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE); XLogFlush(recptr); } After generating CLOG WAL records, the redo recovery routine is also needed:\n/* * CLOG resource manager\u0026#39;s routines */ void clog_redo(XLogReaderState *record) { ... // When redo info type is CLOG_ZEROPAGE, place the read redo information in memory, then write to the CLOG page file if (info == CLOG_ZEROPAGE) { int\tpageno; int\tslotno; memcpy(\u0026amp;pageno, XLogRecGetData(record), sizeof(int)); LWLockAcquire(XactSLRULock, LW_EXCLUSIVE); slotno = ZeroCLOGPage(pageno, false); SimpleLruWritePage(XactCtl, slotno); Assert(!XactCtl-\u0026gt;shared-\u0026gt;page_dirty[slotno]); LWLockRelease(XactSLRULock); } // When redo info type is CLOG_TRUNCATE, place the read redo information in memory, confirm the page is deletable (write page if not), then truncate the segment else if (info == CLOG_TRUNCATE) { xl_clog_truncate xlrec; memcpy(\u0026amp;xlrec, XLogRecGetData(record), sizeof(xl_clog_truncate)); /* * During XLOG replay, latest_page_number isn\u0026#39;t set up yet; insert a * suitable value to bypass the sanity test in SimpleLruTruncate. */ XactCtl-\u0026gt;shared-\u0026gt;latest_page_number = xlrec.pageno; AdvanceOldestClogXid(xlrec.oldestXact); SimpleLruTruncate(XactCtl, xlrec.pageno); } else elog(PANIC, \u0026#34;clog_redo: unknown op code %u\u0026#34;, info); } What the CLOG redo routine does:\nWhen redo info type is CLOG_ZEROPAGE: finds a suitable slot (evict if necessary), performs writability checks based on the read redo information (actually the CLOG page number), then writes the page to the CLOG file When redo info type is CLOG_TRUNCATE: based on the read redo information (actually the CLOG page number), confirms the page is deletable (write page if not available), then truncates the CLOG segment CLOG Synchronization Summary # CLOG has only two types of WAL logs, neither containing transaction status information. They are only triggered when extending CLOG pages and truncating CLOG segments, and the written WAL record is just a CLOG page number. CLOG\u0026rsquo;s WAL log RMGR type has only one: RM_CLOG_ID. This type has only two info codes: CLOG_ZEROPAGE, CLOG_TRUNCATE.\n/* XLOG stuff */ #define CLOG_ZEROPAGE 0x00 #define CLOG_TRUNCATE 0x10 CLOG WAL synchronization summary: The standby database is essentially not synchronizing CLOG information — it\u0026rsquo;s only synchronizing some CLOG file expansion and deletion information.\nHowever, the standby\u0026rsquo;s CLOG file clearly does have status information, and the standby obviously needs this information for visibility checking. How is the transaction status in CLOG synchronized?\nTransaction ID WAL: Types, Writing, and Redo # The WAL for rmgr=CLOG doesn\u0026rsquo;t contain transaction status. Does the standby not synchronize CLOG transaction information? No — WAL logs do contain transaction ID status information, and CLOG is also updated:\n-- Roll back a transaction, commit a transaction \u0026gt; begin; BEGIN \u0026gt; select txid_current(); txid_current -------------- 1817254 (1 row) \u0026gt; rollback; ROLLBACK \u0026gt; begin; BEGIN \u0026gt; select txid_current(); txid_current -------------- 1817258 (1 row) \u0026gt; commit; COMMIT \u0026gt; checkpoint; CHECKPOINT -- pg_waldump to view transaction ID status in logs [datalzl/pg_wal]$ pg_waldump ../../pg_wal/000000010000007300000008|grep -E \u0026#34;1817254|1817258\u0026#34; rmgr: Transaction len (rec/tot): 34/ 34, tx: 1817254, lsn: 73/400ED210, prev 73/400ED1E0, desc: ABORT 2024-08-01 14:41:26.017612 CST rmgr: Transaction len (rec/tot): 46/ 46, tx: 1817258, lsn: 73/400EEB08, prev 73/400EEAD8, desc: COMMIT 2024-08-01 14:41:37.042545 CST pg_waldump: fatal: error in WAL record at 73/400F7C78: invalid record length at 73/400F7F88: wanted 24, got 0 The WAL records the status of transaction IDs (1817254, 1817258), recorded as ABORT and COMMIT respectively; rmgr is Transaction. Transaction ID status is in WAL logs, but does PostgreSQL write it to the standby\u0026rsquo;s CLOG? Obviously, we need to find this redo information. Based on previous experience, clog_redo represents the WAL redo source code for rmgr=CLOG. Searching the source for _redo should find the WAL redo source code for rmgr=Transaction. Searching\u0026hellip; in xact.c we find the function xact_redo, which mainly calls xact_redo_commit and xact_redo_abort, clearly corresponding to WAL log application logic for committed and rolled-back transactions respectively.\nvoid xact_redo(XLogReaderState *record) { uint8\tinfo = XLogRecGetInfo(record) \u0026amp; XLOG_XACT_OPMASK; /* Backup blocks are not used in xact records */ Assert(!XLogRecHasAnyBlockRefs(record)); if (info == XLOG_XACT_COMMIT) { ... xact_redo_commit(\u0026amp;parsed, XLogRecGetXid(record), record-\u0026gt;EndRecPtr, XLogRecGetOrigin(record)); } ... else if (info == XLOG_XACT_ABORT) { ... xact_redo_abort(\u0026amp;parsed, XLogRecGetXid(record)); } ... } else elog(PANIC, \u0026#34;xact_redo: unknown op code %u\u0026#34;, info); } Taking commit as an example:\nstatic void xact_redo_commit(xl_xact_parsed_commit *parsed, TransactionId xid, XLogRecPtr lsn, RepOriginId origin_id) { ... if (standbyState == STANDBY_DISABLED) { /* * Mark the transaction committed in pg_xact. */ TransactionIdCommitTree(xid, parsed-\u0026gt;nsubxacts, parsed-\u0026gt;subxacts); } else // standby logic { ... /* * Mark the transaction committed in pg_xact. We use async commit * protocol during recovery to provide information on database * consistency for when users try to set hint bits. It is important * that we do not set hint bits until the minRecoveryPoint is past * this commit record. This ensures that if we crash we don\u0026#39;t see hint * bits set on changes made by transactions that haven\u0026#39;t yet * recovered. It\u0026#39;s unlikely but it\u0026#39;s good to be safe. */ // Mark transaction committed in pg_xact TransactionIdAsyncCommitTree(xid, parsed-\u0026gt;nsubxacts, parsed-\u0026gt;subxacts, lsn); ... } It looks like TransactionIdAsyncCommitTree is the function we\u0026rsquo;re looking for that writes to CLOG.\nTo verify the redo logic for transaction commit information in WAL, let\u0026rsquo;s set three breakpoints on the standby\u0026rsquo;s startup process, then execute begin;select txid_current();commit; on the source database to commit a transaction, and see if the standby\u0026rsquo;s startup process hits the functions we want to see when doing redo:\n(gdb) bt #0 TransactionIdAsyncCommitTree (xid=xid@entry=1818665, nxids=0, xids=0x0, lsn=lsn@entry=495398394064) at transam.c:274 #1 0x000000000050c139 in xact_redo_commit (parsed=parsed@entry=0x7ffda52c0fc0, xid=1818665, lsn=495398394064, origin_id=\u0026lt;optimized out\u0026gt;) at xact.c:5805 #2 0x000000000050ffa3 in xact_redo (record=0x2b5ff2434038) at xact.c:5962 #3 0x0000000000519ea5 in StartupXLOG () at xlog.c:7411 #4 0x000000000072f301 in StartupProcessMain () at startup.c:204 #5 0x0000000000528701 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffda52c6ef0) at bootstrap.c:450 #6 0x000000000072c459 in StartChildProcess (type=StartupProcess) at postmaster.c:5494 #7 0x000000000072ec44 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2b5ff242d1c0) at postmaster.c:1407 #8 0x000000000048931f in main (argc=3, argv=0x2b5ff242d1c0) at main.c:210 (gdb) info b Num Type Disp Enb Address What 1 breakpoint keep y 0x000000000050c060 in xact_redo_commit at xact.c:5753 breakpoint already hit 43 times 2 breakpoint keep y 0x0000000000508190 in TransactionIdCommitTree at transam.c:262 3 breakpoint keep y 0x00000000005081a0 in TransactionIdAsyncCommitTree at transam.c:274 breakpoint already hit 1 time The breakpoint TransactionIdAsyncCommitTree is hit, and xid=1818665, which is the transaction ID just committed on the source database. This confirms the code logic we visually traced is correct. So, the standby database\u0026rsquo;s CLOG transaction ID status is synchronized by WAL with rmgr=Transaction.\nSummary # CLOG only stores transaction ID status, not the transaction ID itself Transaction status in CLOG files can be manually located via the transaction ID WAL for rmgr=CLOG only extends and cleans up CLOG files, it does not update transaction status WAL for rmgr=Transaction updates CLOG transaction status References # \u0026ldquo;Quickly Mastering PostgreSQL Version New Features\u0026rdquo;, p24\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYan Shuli, PostgreSQL CLOG Analysis https://www.modb.pro/db/606433\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n\u0026ldquo;PostgreSQL Database Kernel Analysis\u0026rdquo;, Chapter 7, p380-390\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"Sep 3, 2024","externalUrl":null,"permalink":"/en/2024/09/03/postgresql-clog-files-and-standby-synchronization-analysis/","section":"Posts","summary":"Among all relational databases, PostgreSQL’s CLOG is a very special type of log. CLOG’s existence is inseparable from PostgreSQL’s MVCC mechanism. Some basic knowledge about transaction IDs and CLOG won’t be covered in this article. If interested, please refer to CLOG and Hint Bits. This article focuses on the structure of CLOG files, manually locating transaction states, and the CLOG WAL log synchronization mechanism, to further understand PostgreSQL’s CLOG.\nCLOG Segment # CLOG Directory # To distinguish from regular logs, PostgreSQL 10 renamed the CLOG and WAL directories 1:\n","title":"PostgreSQL CLOG Files and Standby Synchronization Analysis","type":"posts"},{"content":" Problem Analysis Overview # The database kept OOMing. Analysis revealed the issue was in query plan generation: planning time ~1 second, planning shared hits ~1 million. After thorough investigation, the root cause was identified as bloat in the statistics base table pg_statistic. On the first SQL execution of a session — due to a CatCacheMiss — the backend accessed and cached an excessive amount of dead tuple data from pg_statistic. Application connections always spawned new sessions, and the combined memory usage across multiple backends was too large, leading to OOM.\nBelow is the detailed analysis process.\nProblem Symptoms # A certain database kept OOMing and restarting. After investigation, we found that while the number of concurrent sessions wasn\u0026rsquo;t high, each session\u0026rsquo;s memory footprint was quite large. The total memory exceeded the cgroup memory limit, causing OOM.\nWe could preliminarily rule out the following causes:\nNot caused by excessive metadata. Too many objects (typically too many partitions) would cause sessions to cache excessive metadata. This database didn\u0026rsquo;t have that many objects. Not caused by SQL execution plan issues. Sorting/hash operations might use too much memory. This database didn\u0026rsquo;t fit that scenario — the SQL in question was a simple sequential scan. During the investigation, we discovered that any simple SQL query in this database took a very long time to execute, and Planning Buffers showed about 1 million hits:\nexplain (analyze,buffers,timing) select * from lzlinfo limit 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..1.02 rows=1 width=71) (actual time=0.011..0.012 rows=1 loops=1) Buffers: shared hit=1 -\u0026gt; Seq Scan on lzlinfo (cost=0.00..480.73 rows=473 width=71) (actual time=0.010..0.010 rows=1 loops=1) Buffers: shared hit=1 Planning: Buffers: shared hit=1127312 -- Abnormal planning shared hit Planning Time: 947.038 ms -- Abnormal planning time Execution Time: 0.035 ms (8 rows) Running the same SQL a second time, the planning time was normal.\nProblem Investigation Process # Printing Execution Plan Statistics # We enabled logging for each phase of the execution plan:\nset log_parser_stats =on; set log_planner_stats =on; set log_executor_stats =on; Then ran the SQL. The log output was as follows:\n2024-08-13 10:02:33.936 CST,\u0026#34;postgres\u0026#34;,\u0026#34;lzldb\u0026#34;,85532,\u0026#34;[local]\u0026#34;,66babe8c.14e1c,13,\u0026#34;idle\u0026#34;,2024-08-13 10:01:48 CST,4/713,0,LOG,00000,\u0026#34;PARSER STATISTICS\u0026#34;,\u0026#34;! system usage stats: ! 0.000046 s user, 0.000046 s system, 0.000091 s elapsed ! [0.001661 s user, 0.001661 s system total] ! 4660 kB max resident size ! 0/0 [0/8] filesystem blocks in/out ! 0/36 [0/996] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [5/0] voluntary/involuntary context switches\u0026#34;,,,,,\u0026#34;explain (analyze,buffers) select *,1 from lzlinfo 2024-08-13 10:02:33.938 CST,\u0026#34;postgres\u0026#34;,\u0026#34;lzldb\u0026#34;,85532,\u0026#34;[local]\u0026#34;,66babe8c.14e1c,14,\u0026#34;EXPLAIN\u0026#34;,2024-08-13 10:01:48 CST,4/713,0,LOG,00000,\u0026#34;PARSE ANALYSIS STATISTICS\u0026#34;,\u0026#34;! system usage stats: ! 0.001459 s user, 0.000000 s system, 0.001464 s elapsed ! [0.003146 s user, 0.001687 s system total] ! 5972 kB max resident size ! 0/0 [0/8] filesystem blocks in/out ! 0/325 [0/1324] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [5/0] voluntary/involuntary context switches\u0026#34;,,,,,\u0026#34;explain (analyze,buffers) select *,1 from lzlinfo 2024-08-13 10:02:33.938 CST,\u0026#34;postgres\u0026#34;,\u0026#34;lzldb\u0026#34;,85532,\u0026#34;[local]\u0026#34;,66babe8c.14e1c,15,\u0026#34;EXPLAIN\u0026#34;,2024-08-13 10:01:48 CST,4/713,0,LOG,00000,\u0026#34;REWRITER STATISTICS\u0026#34;,\u0026#34;! system usage stats: ! 0.000001 s user, 0.000000 s system, 0.000001 s elapsed ! [0.003177 s user, 0.001687 s system total] ! 5972 kB max resident size ! 0/0 [0/8] filesystem blocks in/out ! 0/0 [0/1324] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [5/0] voluntary/involuntary context switches\u0026#34;,,,,,\u0026#34;explain (analyze,buffers) select *,1 from lzlinfo 2024-08-13 10:02:34.644 CST,\u0026#34;postgres\u0026#34;,\u0026#34;lzldb\u0026#34;,85532,\u0026#34;[local]\u0026#34;,66babe8c.14e1c,16,\u0026#34;EXPLAIN\u0026#34;,2024-08-13 10:01:48 CST,4/713,0,LOG,00000,\u0026#34;PLANNER STATISTICS\u0026#34;,\u0026#34;! system usage stats: ! 0.539964 s user, 0.164083 s system, 0.705718 s elapsed ! [0.543248 s user, 0.165770 s system total] ! 745072 kB max resident size -- Abnormal point ! 0/0 [0/8] filesystem blocks in/out ! 0/184803 [0/186157] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/1 [5/1] voluntary/involuntary context switches\u0026#34;,,,,,\u0026#34;explain (analyze,buffers) select *,1 from lzlinfo 2024-08-13 10:02:34.644 CST,\u0026#34;postgres\u0026#34;,\u0026#34;lzldb\u0026#34;,85532,\u0026#34;[local]\u0026#34;,66babe8c.14e1c,17,\u0026#34;EXPLAIN\u0026#34;,2024-08-13 10:01:48 CST,4/713,0,LOG,00000,\u0026#34;EXECUTOR STATISTICS\u0026#34;,\u0026#34;! system usage stats: ! 0.540248 s user, 0.164170 s system, 0.706088 s elapsed ! [0.543532 s user, 0.165857 s system total] ! 745596 kB max resident size ! 0/0 [0/8] filesystem blocks in/out ! 0/184898 [0/186252] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/1 [5/1] voluntary/involuntary context switches\u0026#34;,,,,,\u0026#34;explain (analyze,buffers) select *,1 from lzlinfo \u0026#34; During the planner phase, memory usage skyrocketed and elapsed time also spiked. This pinpointed the issue to the planner phase within the overall planning stage. There wasn\u0026rsquo;t much else actionable from the stats.\nstrace Tracing # strace -p 76419 strace: Process 76419 attached epoll_wait(4, [{EPOLLIN, {u32=15422552, u64=15422552}}], 1, -1) = 1 recvfrom(9, \u0026#34;Q\\0\\0\\0\\262explain (analyze,buffers) s\u0026#34;..., 8192, 0, NULL, NULL) = 179 lseek(5, 0, SEEK_END) = 8192 brk(NULL) = 0xfed000 brk(0x100e000) = 0x100e000 brk(NULL) = 0x100e000 brk(NULL) = 0x100e000 brk(0x1007000) = 0x1007000 brk(NULL) = 0x1007000 mmap(NULL, 270336, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b7806b0c000 open(\u0026#34;base/17076/16678\u0026#34;, O_RDWR) = 7 lseek(7, 0, SEEK_END) = 0 open(\u0026#34;base/17076/46160\u0026#34;, O_RDWR) = 12 lseek(12, 0, SEEK_END) = 7667712 open(\u0026#34;base/17076/46168\u0026#34;, O_RDWR) = 13 lseek(13, 0, SEEK_END) = 188416 open(\u0026#34;base/17076/46170\u0026#34;, O_RDWR) = 14 lseek(14, 0, SEEK_END) = 188416 mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b78c1b36000 brk(NULL) = 0x1007000 brk(0x102c000) = 0x102c000 brk(NULL) = 0x102c000 brk(NULL) = 0x102c000 brk(0x1025000) = 0x1025000 brk(NULL) = 0x1025000 lseek(12, 0, SEEK_END) = 7667712 open(\u0026#34;pg_stat_tmp/pgss_query_texts.stat\u0026#34;, O_RDWR|O_CREAT, 0600) = 15 pwrite64(15, \u0026#34;explain (analyze,buffers) select\u0026#34;..., 172, 93934) = 172 pwrite64(15, \u0026#34;\\0\u0026#34;, 1, 94106) = 1 close(15) = 0 sendto(8, \u0026#34;\\2\\0\\0\\0\\250\\3\\0\\0\\264B\\0\\0\\10\\0\\0\\0\\1\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 936, 0, NULL, 0) = 936 sendto(8, \u0026#34;\\2\\0\\0\\0\\250\\3\\0\\0\\264B\\0\\0\\10\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 936, 0, NULL, 0) = 936 sendto(8, \u0026#34;\\2\\0\\0\\0\\250\\3\\0\\0\\264B\\0\\0\\10\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 936, 0, NULL, 0) = 936 sendto(8, \u0026#34;\\2\\0\\0\\0\\250\\3\\0\\0\\264B\\0\\0\\10\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 936, 0, NULL, 0) = 936 sendto(8, \u0026#34;\\2\\0\\0\\0\\250\\3\\0\\0\\264B\\0\\0\\10\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 936, 0, NULL, 0) = 936 sendto(8, \u0026#34;\\2\\0\\0\\0\\10\\1\\0\\0\\264B\\0\\0\\2\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 264, 0, NULL, 0) = 264 sendto(8, \u0026#34;\\2\\0\\0\\0\\10\\1\\0\\0\\0\\0\\0\\0\\2\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 264, 0, NULL, 0) = 264 sendto(8, \u0026#34;\\16\\0\\0\\0H\\0\\0\\0\\6\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\1\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 72, 0, NULL, 0) = 72 sendto(9, \u0026#34;T\\0\\0\\0#\\0\\1QUERY PLAN\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\31\\377\\377\\377\\377\u0026#34;..., 826, 0, NULL, 0) = 826 recvfrom(9, 0xd2b4e0, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, Although there were many shared hits, strace didn\u0026rsquo;t reveal much. strace showed the session only opened 4 data files. Using fd and oid2name to look up the data files, they turned out to be: the table, two indexes on the table, and pathman_config:\nFrom database \u0026#34;lzldb\u0026#34;: Filenode Table Name -------------------------------------- 46170 ix_name 46168 pk_lzlinfo 46160 lzlinfo 16678 pathman_config These objects are not large, so it didn\u0026rsquo;t look like oversized tables (or indexes) were the cause.\nperf # (No screenshot — use your imagination.)\nThe perf flame graph showed ~40% of the time spent on the heap_hot_search_buffer stack.\ngdb # Using heap_hot_search_buffer as a clue, after multiple gdb sessions, we set the following breakpoints to investigate:\nb relation_open b get_relation_info b RelationCacheInvalidateEntry b get_relname_relid b AcceptInvalidationMessages b RelationClearRelation b pg_hint_plan_planner b heap_hot_search_buffer When breakpoints first hit, there was a lot of noise — they were normal logic. But later, after execution reached a certain point, only heap_hot_search_buffer kept hitting:\nBreakpoint 15, heap_hot_search_buffer (tid=tid@entry=0x2313c60, relation=0x2b2141663910, buffer=17045, snapshot=snapshot@entry=0x228a058, heapTuple=heapTuple@entry=0x23273d0, all_dead=all_dead@entry=0x7ffce272e28f, first_call=true) at heapam.c:1503 1503 in heapam.c (gdb) Continuing. ... Breakpoint 15, heap_hot_search_buffer (tid=tid@entry=0x2313c60, relation=0x2b2141663910, buffer=96708, snapshot=snapshot@entry=0x228a058, heapTuple=heapTuple@entry=0x23273d0, all_dead=all_dead@entry=0x7ffce272e28f, first_call=true) at heapam.c:1503 1503 in heapam.c Most arguments passed to heap_hot_search_buffer remained unchanged — including the addresses of relation and heapTuple — only the buffer parameter changed, indicating it was scanning the same relation.\nheapTuple contained table OID information. Let\u0026rsquo;s print it:\n(gdb) p *heapTuple $46 = { t_len = 968, t_self = { ip_blkid = { bi_hi = 0, bi_lo = 7211 }, ip_posid = 5 }, t_tableOid = 2619, -- This is useful t_data = 0x2b2155fced00 heap_hot_search_buffer was called with OID=2619. Looking up 2619 in pg_class, it\u0026rsquo;s pg_statistic:\nselect oid,relname from pg_class where oid in (2619) oid | relname -------+---------------------------------- 2619 | pg_statistic Accessing the statistics base table is expected — PG needs statistics to estimate costs when generating candidate execution plans.\npg_statistic Bloat # Now that we\u0026rsquo;ve pinpointed pg_statistic, let\u0026rsquo;s check its condition:\n\u0026gt; \\dt+ pg_statistic List of relations Schema | Name | Type | Owner | Persistence | Size | Description ------------+--------------+-------+----------+-------------+---------+------------- pg_catalog | pg_statistic | table | postgres | permanent | 1036 MB | \u0026gt; select * from pg_class where relname=\u0026#39;pg_statistic\u0026#39;\\gx -[ RECORD 1 ]-------+------------------------------------------------ oid | 2619 relname | pg_statistic relnamespace | 11 reltype | 12016 reloftype | 0 relowner | 10 relam | 2 relfilenode | 2619 reltablespace | 0 relpages | 132481 reltuples | 4655 pg_statistic is 1GB — certainly oversized. 132,481 blocks but only 4,655 rows — this is clearly table bloat. But even with bloat, does accessing statistics really require caching the entire pg_statistic table? Logically, no — you only need the statistics for the specific table. And indeed, PG accesses pg_statistic through its primary key index pg_statistic_relid_att_inh_index. From the call stack below, we can see the composite primary key fields being passed:\nbt ... #6 0x000000000086edbc in SearchCatCacheMiss (cache=cache@entry=0x226ba80, nkeys=nkeys@entry=3, hashValue=hashValue@entry=853716409, hashIndex=hashIndex@entry=57, v1=v1@entry=18767, v2=v2@entry=1, v3=v3@entry=0, v4=v4@entry=0) at catcache.c:1368 #7 0x000000000086fa82 in SearchCatCacheInternal (v4=0, v3=\u0026lt;optimized out\u0026gt;, v2=\u0026lt;optimized out\u0026gt;, v1=\u0026lt;optimized out\u0026gt;, nkeys=3, cache=0x226ba80) at catcache.c:1299 #8 SearchCatCache3 (cache=0x226ba80, v1=v1@entry=18767, v2=v2@entry=1, v3=v3@entry=0) at catcache.c:1183 #9 0x0000000000880d70 in SearchSysCache3 (cacheId=cacheId@entry=58, key1=key1@entry=18767, key2=key2@entry=1, key3=key3@entry=0) at syscache.c:1145 #10 0x0000000000874092 in get_attavgwidth (relid=relid@entry=18767, attnum=1) at lsyscache.c:2991 #11 0x00000000006a2d46 in set_rel_width (root=root@entry=0x2326600, rel=rel@entry=0x21e8418) at costsize.c:5516 ... The call passes relid=relid@entry=18767, attnum=1:\nselect ctid,starelid,staattnum from pg_statistic where starelid=18767; ctid | starelid | staattnum ------------+----------+----------- (132657,6) | 18767 | 1 (132657,7) | 18767 | 2 (132657,8) | 18767 | 3 (132657,9) | 18767 | 4 (132658,1) | 18767 | 5 (132658,2) | 18767 | 6 (132658,3) | 18767 | 7 (132658,4) | 18767 | 8 (132658,5) | 18767 | 9 (132658,6) | 18767 | 10 -- lzlinfo has 10 columns total, each with a staattnum entry From the ctid, we can see this data actually lives in just 2 blocks.\nNow let\u0026rsquo;s access pg_statistic via the composite primary key index. Even with data in only 2 blocks, it took 1 second to access with ~1 million (1,141,568) shared hits:\nexplain (analyze,buffers,timing,verbose) select ctid,starelid from pg_statistic where starelid=18767; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using pg_statistic_relid_att_inh_index on pg_catalog.pg_statistic (cost=0.41..103.31 rows=23 width=10) (actual time=105.416..1035.723 rows=10 loops=1) Output: ctid, starelid Index Cond: (pg_statistic.starelid = \u0026#39;18767\u0026#39;::oid) Buffers: shared hit=1141568 -- Abnormal Planning: Buffers: shared hit=8 Planning Time: 0.102 ms Execution Time: 1035.802 ms Accessing 10 rows in pg_statistic via the index resulted in ~1M shared hits — roughly matching the ~1M planning shared hits from the original SQL. (Note: Planning Time here is minimal, meaning the issue is not in plan generation per se, but in the data access during planning.)\nIndex Dead Tuples # If vacuum hasn\u0026rsquo;t truly \u0026ldquo;run properly\u0026rdquo;, index dead tuples still point to dead heap tuples.\nRefer to: From Very Slow Unique Index Scans to Index Bloat\nautovacuum Not Reclaiming Dead Tuples # With such severe table bloat, shouldn\u0026rsquo;t autovacuum have reclaimed it?\nselect * from pg_stat_all_tables where relname=\u0026#39;pg_statistic\u0026#39;\\gx -[ RECORD 1 ]-------+------------------------------ relid | 2619 schemaname | pg_catalog relname | pg_statistic seq_scan | 1 -- Very few sequential scans on pg_statistic seq_tup_read | 4655 idx_scan | 28715508 -- Many index scans on pg_statistic idx_tup_fetch | 25150245 n_tup_ins | 46 n_tup_upd | 1292143 -- Lots of updates n_tup_del | 14 n_tup_hot_upd | 138448 n_live_tup | 4655 n_dead_tup | 1496776 n_mod_since_analyze | 1292203 n_ins_since_vacuum | 0 last_vacuum | [null] last_autovacuum | 2024-08-16 20:34:15.045022+08 -- Note: autovacuum timestamp is recent last_analyze | [null] last_autoanalyze | [null] vacuum_count | 0 autovacuum_count | 144170 analyze_count | 0 autoanalyze_count | 0 Actually, autovacuum was constantly running on pg_statistic, but the worker process may not have been visible because it finished quickly (having nothing to actually reclaim) and went back to naptime:\nshow autovacuum_naptime ; autovacuum_naptime -------------------- 1min It naps every 1 minute, and the logs show autovacuum info printed every 1 minute as well:\n2024-08-16 21:05:15.267 CST,,,41080,,66bf4e87.a078,1,,2024-08-16 21:05:11 CST,27/166839,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzldb.pg_catalog.pg_statistic\u0026#34;\u0026#34;: index scans: 0 pages: 0 removed, 132685 remain, 1 skipped due to pins, 0 skipped frozen tuples: 0 removed, 1501745 remain, 1497090 are dead but not yet removable, oldest xmin: 119329380 buffer usage: 265443 hits, 0 misses, 0 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s system usage: CPU: user: 0.53 s, system: 0.17 s, elapsed: 3.38 s WAL usage: 1 records, 0 full page images, 233 bytes\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; 2024-08-16 21:05:17.474 CST,,,41080,,66bf4e87.a078,2,,2024-08-16 21:05:11 CST,27/166844,136438968,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzldb.public.lzlinfo\u0026#34;\u0026#34; system usage: CPU: user: 2.02 s, system: 0.00 s, elapsed: 2.08 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; \u0026#34; 1497090 are dead but not yet removable — although autovacuum was triggered, it didn\u0026rsquo;t reclaim any dead tuples at all. 1,497,090 dead tuples remained uncleaned.\nInvestigating who held oldest xmin: 119329380, we quickly identified a replication slot:\nselect * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size -----------------+----------+-----------+--------+----------+-----------+--------+------------+--------+--------------+--------------+---------------------+------------+--------------- slotslotlostname | pgoutput | logical | 17076 | lzldb | f | f | [null] | [null] | 119329380 | 3F9/105A4970 | 3F9/105F8778 | extended | [null] The slot\u0026rsquo;s catalog_xmin=119329380 matched the vacuum\u0026rsquo;s oldest xmin: 119329380.\nactive=f indicated that the replication link was already broken.\nFixing the Problem # Drop the replication slot:\nselect pg_drop_replication_slot(\u0026#39;slotslotlostname\u0026#39;); pg_drop_replication_slot -------------------------- Then manually vacuum or wait 1 minute for autovacuum.\nFinally, open a brand-new session to verify the fix:\n## psql psql (13.2) Type \u0026#34;help\u0026#34; for help. \u0026gt; \\c lzldb You are now connected to database \u0026#34;lzldb\u0026#34; as user \u0026#34;postgres\u0026#34;. \u0026gt; explain (analyze,buffers,timing) select * from lzlinfo limit 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..8.04 rows=1 width=71) (actual time=0.023..0.025 rows=1 loops=1) Buffers: shared hit=1 -\u0026gt; Seq Scan on lzlinfo (cost=0.00..3802.73 rows=473 width=71) (actual time=0.018..0.018 rows=1 loops=1) Buffers: shared hit=1 Planning: Buffers: shared hit=2578 Planning Time: 9.605 ms Execution Time: 0.098 ms Planning time dropped from ~1 second to ~10 ms, and planning shared hits dropped from ~1M to ~2K. The problem was basically resolved.\nCase Summary # The replication link broke and the replication slot wasn\u0026rsquo;t cleaned up in time, leading to bloat in the pg_statistic statistics base table. This caused each backend to be very slow when loading statistics for the first time and to read excessive pages into its local cache. Each backend\u0026rsquo;s cache exceeded normal levels (~2GB), and with multiple backends this led to OOM.\nThe problem itself is simple — it was just the investigation that was convoluted. In short: bloat in the base table pg_statistic caused excessive data access during the plan generation phase. Metadata base table bloat can cause other tricky problems too — until next time.\n","date":"Aug 21, 2024","externalUrl":null,"permalink":"/en/2024/08/21/postgresql-case-study-analysis-of-abnormally-long-planning-time/","section":"Posts","summary":"Problem Analysis Overview # The database kept OOMing. Analysis revealed the issue was in query plan generation: planning time ~1 second, planning shared hits ~1 million. After thorough investigation, the root cause was identified as bloat in the statistics base table pg_statistic. On the first SQL execution of a session — due to a CatCacheMiss — the backend accessed and cached an excessive amount of dead tuple data from pg_statistic. Application connections always spawned new sessions, and the combined memory usage across multiple backends was too large, leading to OOM.\n","title":"PostgreSQL Case Study: Analysis of Abnormally Long Planning Time","type":"posts"},{"content":" Why I Read This Book # In the final pages of Elon Musk, the author briefly introduced two books by economist Tyler Cowen: The Great Stagnation and Average Is Over. The Great Stagnation is about why America\u0026rsquo;s development has stalled over the past 40 years — something I\u0026rsquo;m naturally not that interested in. But Average Is Over is not a study of history; it\u0026rsquo;s a perspective on future development, especially the impact of AI on human life.\nI\u0026rsquo;ve always been interested in what human life will look like in the future. Recently, OpenAI has been hot, and it feels like the AI era is upon us. What changes will AI bring to our lives and work? Will social structures shift? Which jobs will gradually disappear? Which jobs will benefit?\nChess # The book spends a large portion (nearly half) discussing chess and computer programs. You can tell the author is definitely a chess enthusiast — he\u0026rsquo;s deeply knowledgeable about chess history and its evolution. Reading this section always reminds me of The Queen\u0026rsquo;s Gambit. If it weren\u0026rsquo;t for that show, I wouldn\u0026rsquo;t have known chess had rapid formats or that the Soviet Union was the world\u0026rsquo;s strongest chess nation. The author also uses chess to explore the influence of computer programs on the game.\nThis influence goes beyond AlphaGo defeating the world\u0026rsquo;s strongest human Go player — the \u0026ldquo;beating the brightest human minds\u0026rdquo; kind of impact. It also includes how early chess programs changed the way humans learn chess. In the early days of chess, before computers took off, people could only learn chess from other people. A beginner couldn\u0026rsquo;t often play against a chess master. But as computer programs became widespread, they were adopted en masse. Chess programs could teach you, you could play against them, and you could even set the difficulty level. This was incredibly convenient for beginners. Without us even noticing, computer programs quietly reshaped our lives. In the future, we will increasingly collaborate with AI.\nPolarization # Once AI is widely deployed, many aspects of our lives will change. AI is unlikely to revolutionarily overturn the social structure of rich and poor; the reality that a tiny minority controls the vast majority of wealth may intensify further. The middle class is perhaps the most vulnerable stratum. Many middle-class workers perform partially intellectual but repetitive work — exactly AI\u0026rsquo;s sweet spot. The book argues that the value of middle-class work isn\u0026rsquo;t actually that great and may be relatively easily replaced. Disparities in basic assets will widen the gap in wealth accumulation — in other words, differences in starting capital will amplify differences in asset accumulation. In this age, that sentence is easy to understand.\nThe book approaches wealth distribution from an American perspective, but it\u0026rsquo;s easy to map onto the Chinese context. China\u0026rsquo;s economic development over recent decades has been truly remarkable — the dividends of population and infrastructure construction, a phase all developed nations went through. But the introduction of market economics and the passage of time have been accompanied by growing wealth inequality. Let\u0026rsquo;s leave it there\u0026hellip; I don\u0026rsquo;t want to write anything too sensitive\u0026hellip;\nThe Rising Cost of Learning # The cost of learning keeps rising. This doesn\u0026rsquo;t refer to the cost of tuition or training courses, but the difficulty of learning or mastering a profession. The word \u0026ldquo;inventor\u0026rdquo; — I suspect many people haven\u0026rsquo;t heard it in a long time. Our impression of the term is still stuck in the Edison era. Back then, individuals could invent things on their own; they just needed some relatively advanced knowledge in their field and a bit of brainpower. \u0026ldquo;Inventing\u0026rdquo; didn\u0026rsquo;t seem that hard. But as time passed, we rarely hear the word \u0026ldquo;inventor\u0026rdquo; anymore. It\u0026rsquo;s not that humans have stopped inventing — it\u0026rsquo;s that what people invent now is almost always the work of a team, many people, often requiring cross-disciplinary collaboration among multiple specialists. The cost of \u0026ldquo;inventing\u0026rdquo; things keeps rising because the knowledge required to master a field grows ever larger and more complex. It\u0026rsquo;s unrealistic for one person to master an entire industry; people tend to specialize in narrower domains — and even a narrow domain is enough for a lifetime of study.\nAcademia today faces this exact situation. A relatively successful paper typically requires experts from various fields to use their specialized knowledge to verify the correctness of one small segment of a proof. The book gives a classic example: if a mathematician proves a conjecture in mathematics, there may be only a handful of people in the entire world who can truly understand what the mathematician is proving. Most of them may only understand one section of the content — and even the mathematician themselves may only say: \u0026ldquo;I might be right.\u0026rdquo; We have no way to verify the correctness of the proof.\nHuman knowledge is becoming increasingly complex. Scientists now tend to, and increasingly do, hand calculations and experiments over to machines. Humanity seems to have reached a tipping point: our brains are nearly incapable of understanding this knowledge anymore. From a biological perspective, the human brain necessarily has a limit. The processing speed of the human brain can\u0026rsquo;t remotely keep up with machines.\nSelf-Learning # Even as learning costs rise, education will become ever more important in the future. The education system may change. Since time will be more precious in the future world and learning resources will be easier to access, people will lean toward online learning and self-directed learning. At the same time, this makes self-drive even more critical.\nAs an IT professional, I have a deep appreciation for self-learning. This industry is intensely competitive; if you don\u0026rsquo;t keep learning, you\u0026rsquo;re basically on the brink of obsolescence. But highly effective learning is also reflected in your salary. Our parents\u0026rsquo; generation relied on assigned jobs and could work in one position for decades without major changes. People back then just thought about working, not obsessively self-improving and chasing certifications. Times have truly changed. How many people, like me, are still writing articles at 11 PM? I\u0026rsquo;m even baffled by industries where you don\u0026rsquo;t need to keep learning after graduation — just how backward are they? You graduate university in your early twenties and still have decades to learn. It would be utterly strange to just stagnate there. Of course, I don\u0026rsquo;t like cutthroat competition, but I like standing still even less — especially in an age where just sitting on a stool spacing out causes the wealth gap to widen.\nFinally # The cover and illustrations for this post are all AI-generated. I just typed in \u0026ldquo;goodbye age of mediocrity\u0026rdquo; and the AI produced astonishing images. I don\u0026rsquo;t know exactly which industries or professions will disappear in the future, but at the very least, illustrators are going to have a hard time surviving in the AI era.\nAI has already invaded the IT domain. As a DBA, which of our work patterns will be replaced? That\u0026rsquo;s a question worth pondering. Whatever happens, in this age, only learning can keep you competitive. I hope none of us will be the \u0026ldquo;disappearing shoulder pole porter.\u0026rdquo;\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/book-notes-average-is-over/","section":"Posts","summary":"Why I Read This Book # In the final pages of Elon Musk, the author briefly introduced two books by economist Tyler Cowen: The Great Stagnation and Average Is Over. The Great Stagnation is about why America’s development has stalled over the past 40 years — something I’m naturally not that interested in. But Average Is Over is not a study of history; it’s a perspective on future development, especially the impact of AI on human life.\n","title":"Book Notes — Average Is Over","type":"posts"},{"content":"I\u0026rsquo;ve actually wanted to write about these two books for a long time. I love reading, but I absolutely detest writing. Maintaining a blog is practically a miracle for me. I love reading because I love and believe in the power of education. As for how much I hate writing, let me tell you a little story.\nI Hate Writing Essays # My dislike for writing is practically innate. Since elementary school, I never wrote diaries or essays. Every winter and summer break homework required a daily diary entry — I never wrote a single one. I still remember when school started and I had to turn in summer homework. The teacher threatened that if I didn\u0026rsquo;t finish it, I couldn\u0026rsquo;t attend class. I still wrote nothing and just sat in the classroom as usual. Later, for some assignment, our homeroom teacher had to submit examples of correcting typically flawed students — only 4 or 5 kids in the class were selected. One was bad at sports, one had a temper problem\u0026hellip; and I was the one bad at writing! And the task was to write an essay about correcting that flaw! I can\u0026rsquo;t write — why would you make me write an essay about fixing my inability to write??? I dragged it out for two weeks. All the other students turned theirs in. I couldn\u0026rsquo;t squeeze out a single word. The homeroom teacher personally coached me. She said: when you walk down the street, you can turn anything you see into a sentence. See a blue sky? You can form a sentence in your mind: \u0026ldquo;The sky is cloudless for miles.\u0026rdquo; Practicing sentence construction regularly can help with writing. You could also think of other ways to fix your writing aversion. Another week passed, and I wrote down exactly what she told me, verbatim. I could see the frustration in her eyes. Later, in middle school, I cleverly befriended the Chinese class representative so she\u0026rsquo;d leave my name off the missing-homework list. That\u0026rsquo;s how I dodged three years of middle school. Then in high school, I once awkwardly wrote an essay my own way and scored 30 out of 60 — a devastating blow. So for every monthly exam, I simply didn\u0026rsquo;t write the essay. I figured: it\u0026rsquo;s just mock exams, not the Gaokao — I\u0026rsquo;ll just forfeit those 60 points. Finally, for the actual Gaokao and the two mock exams before it, I crammed Qu Yuan and Li Bai into essay templates like eight-legged essays. I found there was nothing you couldn\u0026rsquo;t cram them into, and I muddled through the Gaokao essay hurdle. College? No need to mention it — my hand had forgotten how to hold a pen.\nYes, with this peculiar writing psychology, I hated essays. But after entering the workforce, I gradually understood: the dullest pencil is better than the sharpest memory. No matter how many books you read, you need to internalize them. The pressures of ambition, family, and work forced me to change. Whether I needed this skill or not didn\u0026rsquo;t seem to matter — if society needs it, I should try to adapt. Writing not only pushes you forward, it\u0026rsquo;s also a way to record growth, to record life. Even my own technical articles — years later, I still have to come back and read them carefully, review them carefully.\nReading Originals Is Good for Body and Mind # I read both Educated and Atomic Habits in their original English. I had a bit of an English foundation, and since I was preparing for graduate school entrance exams at the time, I wanted to improve my English reading — so I chose English originals. At first, reading English originals was quite difficult. Many words were unfamiliar, and I\u0026rsquo;d look them up and annotate them in the book. Progress was painfully slow. But as I read deeper, there were fewer and fewer annotations. It wasn\u0026rsquo;t that I quickly memorized many new words — rather, some important words appear repeatedly throughout a book, while others that appear rarely don\u0026rsquo;t affect comprehension. Also, at the start you don\u0026rsquo;t know what the book is about, so comprehension latency is high. Later, once you know where it\u0026rsquo;s headed, reading naturally speeds up. For instance, the word \u0026ldquo;ridge\u0026rdquo; appeared very frequently early on, and I eventually remembered it. Some similar words I still can\u0026rsquo;t remember, but I know they\u0026rsquo;re some kind of geographic term — summit, valley, ridge — and even without remembering them precisely, it doesn\u0026rsquo;t stop me from reading. That\u0026rsquo;s how English originals work: difficult at first, faster the further you go.\nEducated # The Chinese title of Educated is You Should Fly Like a Bird to Your Mountain — I really want to complain about this title. \u0026ldquo;Educated\u0026rdquo; is the spiritual essence of the entire book, and near the end, the author drives it home with \u0026ldquo;call it educated\u0026rdquo; — absolutely brilliant. This Chinese title is like shit, completely missing the book\u0026rsquo;s essence. The author\u0026rsquo;s personal experience is legendary: a child who walked out of some corner of the American mountains, who through sheer effort studied all the way to Cambridge. Her father was uneducated, anti-social, lacking basic physics knowledge — his ignorance led to family members getting injured or even disabled. He disapproved of children going to school, even believing education was government brainwashing. Countless absurd behaviors. Her brother also had personality issues — he shoved her head into a filthy toilet and made her beg for mercy, then the next day acted like nothing happened and continued being her \u0026ldquo;good brother\u0026rdquo;\u0026hellip; Later, the author found her way out through education, and in the end, she didn\u0026rsquo;t want to return to that valley. Reading the ending always reminds me of my own experience. Of course, I didn\u0026rsquo;t have such an extreme environment, nor such a legendary journey, but I feel like I can understand — after being educated, family interactions somehow feel unnatural. It\u0026rsquo;s not about getting cocky after university — the generation gap is real. I deeply believe in the importance of education. If my family hadn\u0026rsquo;t sold everything they had to fully support my education, our circumstances would never have changed. If you\u0026rsquo;ve truly been mired in poverty, you know how fierce the desire to escape it is — and education is almost the only way out for people like us. Educated is a great book: clear prose, comfortable sentence structure, suited to modern reading rhythms, a gripping story, a profound theme. It\u0026rsquo;s an excellent choice as your first English original.\nAtomic Habits # Atomic Habits — I\u0026rsquo;ve forgotten exactly how I found this book, but it changed my understanding of behavior. Building good habits isn\u0026rsquo;t actually that hard; most people just don\u0026rsquo;t know how. Many have said: I\u0026rsquo;ll read X books in a few months, run Y kilometers, lose Z pounds — but they rarely follow through. Building good habits requires genuinely liking the habit, changing your mindset, reducing the friction of the action, putting obstacles farther away, forming reward mechanisms, and so on. When you want to become a certain kind of person, don\u0026rsquo;t focus on how to become that person — think about what that kind of person does, and learn to do it. For example, quitting smoking: if your brain thinks you\u0026rsquo;re \u0026ldquo;in the process of quitting,\u0026rdquo; it\u0026rsquo;s very hard. If someone offers you a cigarette and you say \u0026ldquo;I\u0026rsquo;m quitting,\u0026rdquo; a few words from them might get you to smoke. But if you genuinely believe you\u0026rsquo;re someone who \u0026ldquo;doesn\u0026rsquo;t smoke\u0026rdquo; — note, this must be your authentic inner belief — when someone offers you a cigarette, you\u0026rsquo;ll simply say \u0026ldquo;I don\u0026rsquo;t smoke,\u0026rdquo; and you probably won\u0026rsquo;t have to smoke it. Some small details: say you want to build a habit of reading at night — you need to break the habit of scrolling on your phone. Move your books from the bookshelf to your bedside for easier access. Put your phone at the foot of the bed, making getting up the barrier to grabbing the phone — this makes it easier to reach for the book instead of the phone. If picking up the book is still hard, reframe your thinking: \u0026ldquo;reading\u0026rdquo; as an action may feel difficult, but break it down — \u0026ldquo;pick up the book\u0026rdquo; or \u0026ldquo;open to the first page\u0026rdquo; becomes your mental target. The startup action for reading is simple and easy to complete. After reading the first page, think about what comes next — and in reality, once you\u0026rsquo;ve read the first page, it\u0026rsquo;s hard not to read the second. Of course, there are many more excellent suggestions for building good habits and shedding bad ones — every word is a gem, thoroughly engaging. After reading Atomic Habits, whenever I want a certain habit, I first consider the book\u0026rsquo;s guidance, then plan how to implement it — rather than acting on impulse.\nFinally # At last — these two books have had an enormous impact on me. One is a legendary autobiography; the other is a behavior-transforming book. Neither is the kind of work you forget shortly after reading. They\u0026rsquo;re perfect starter books for cultivating a reading habit, especially for those wanting to read English originals. I really don\u0026rsquo;t recommend Pride and Prejudice or One Hundred Years of Solitude — yes, they\u0026rsquo;re classics, but their impact on the reader is quite low, and they were written so long ago that some vocabulary and grammar are too archaic, making them unsuitable for first-time English readers. Looking at this through the lens of Atomic Habits: reading these English classics is not only more difficult but also lacks immediate personal benefit, making it hard to form a habit.\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/book-notes-educated-and-atomic-habits/","section":"Posts","summary":"I’ve actually wanted to write about these two books for a long time. I love reading, but I absolutely detest writing. Maintaining a blog is practically a miracle for me. I love reading because I love and believe in the power of education. As for how much I hate writing, let me tell you a little story.\nI Hate Writing Essays # My dislike for writing is practically innate. Since elementary school, I never wrote diaries or essays. Every winter and summer break homework required a daily diary entry — I never wrote a single one. I still remember when school started and I had to turn in summer homework. The teacher threatened that if I didn’t finish it, I couldn’t attend class. I still wrote nothing and just sat in the classroom as usual. Later, for some assignment, our homeroom teacher had to submit examples of correcting typically flawed students — only 4 or 5 kids in the class were selected. One was bad at sports, one had a temper problem… and I was the one bad at writing! And the task was to write an essay about correcting that flaw! I can’t write — why would you make me write an essay about fixing my inability to write??? I dragged it out for two weeks. All the other students turned theirs in. I couldn’t squeeze out a single word. The homeroom teacher personally coached me. She said: when you walk down the street, you can turn anything you see into a sentence. See a blue sky? You can form a sentence in your mind: “The sky is cloudless for miles.” Practicing sentence construction regularly can help with writing. You could also think of other ways to fix your writing aversion. Another week passed, and I wrote down exactly what she told me, verbatim. I could see the frustration in her eyes. Later, in middle school, I cleverly befriended the Chinese class representative so she’d leave my name off the missing-homework list. That’s how I dodged three years of middle school. Then in high school, I once awkwardly wrote an essay my own way and scored 30 out of 60 — a devastating blow. So for every monthly exam, I simply didn’t write the essay. I figured: it’s just mock exams, not the Gaokao — I’ll just forfeit those 60 points. Finally, for the actual Gaokao and the two mock exams before it, I crammed Qu Yuan and Li Bai into essay templates like eight-legged essays. I found there was nothing you couldn’t cram them into, and I muddled through the Gaokao essay hurdle. College? No need to mention it — my hand had forgotten how to hold a pen.\n","title":"Book Notes — Educated and Atomic Habits","type":"posts"},{"content":" Gifted # Musk\u0026rsquo;s ancestors, driven by a love of adventure, emigrated from America to South Africa. His maternal grandfather even flew a plane from Africa to Australia. Musk was born in South Africa and showed astonishing memory and brilliance from an early age. His mother, Maye Musk, told his teacher: \u0026ldquo;My son is a genius.\u0026rdquo; The teacher replied, \u0026ldquo;Yes, every mother says that.\u0026rdquo; Maye: \u0026ldquo;No, I mean he really is a genius.\u0026rdquo; As a child, Musk sometimes seemed \u0026ldquo;slow to react.\u0026rdquo; His mother said when people talked to him, he\u0026rsquo;d give no response at all. She thought something was wrong with his brain and even took him to a doctor. But later she discovered Musk was simply immersed in his own world of thought. As a child, Musk could even finish reading the entire library\u0026rsquo;s collection and then ask the library to get more books\u0026hellip;\nArriving in America # Due to the less-than-ideal environment in South Africa, Musk, approaching university age, executed a two-step jump. He first went to university in Canada, then to the United States for his master\u0026rsquo;s. Upon finally reaching America, Musk immersed himself in Silicon Valley\u0026rsquo;s work environment. The tech industry desperately needed young people like him — brilliant and relentless. And Silicon Valley\u0026rsquo;s tech atmosphere and culture of freely exercising one\u0026rsquo;s talents let Musk dive in completely.\nZip2 and PayPal # Soon Musk founded Zip2, essentially a corporate version of online maps. While we\u0026rsquo;re now very familiar with online maps, the US internet industry was just getting started back then — this was all novel stuff. After many twists and turns, Zip2 did grow. Personally, I think Zip2\u0026rsquo;s model would have struggled to survive long-term without pivoting toward online maps or something like Yelp. Eventually, some sucker bought Zip2 for $300 million, instantly turning Musk into a multimillionaire and Silicon Valley tech tycoon. You could actually tell — Zip2 was deeply divided internally, had directional problems, and Musk didn\u0026rsquo;t have absolute decision-making power. He probably wanted out long ago.\nBefore leaving Zip2, Musk was already planning and recruiting for online payments. At that time, the world didn\u0026rsquo;t even have anything like Alipay\u0026hellip; Musk believed traditional finance was too conservative and that there was enormous opportunity to change the industry model. But many bankers didn\u0026rsquo;t believe internet finance could work, because internet finance couldn\u0026rsquo;t handle network security issues — after all, the slightest error in finance could have enormous consequences. Initially, the company Musk founded wasn\u0026rsquo;t PayPal but X.com, which later merged with PayPal and kept the latter\u0026rsquo;s name. Early on, X.com suffered massive attacks but survived. Their security mechanisms at the time had a significant influence on the later online payments industry. PayPal was later acquired by eBay, netting Musk hundreds of millions of dollars — another huge payday.\nSpaceX and Tesla # Zip2 and PayPal were, for Musk, validation of his industry sensitivity and business acumen — though some questioned his execution and decision-making abilities, i.e., his CEO chops. As always, Musk viewed these industries as too conservative and old-fashioned. Musk loved recruiting extremely capable top university graduates and disliked hiring seasoned, conservative-minded industry veterans. He ran both companies simultaneously, and for a long time, neither company produced any product at all. And, as you\u0026rsquo;d imagine, rocket-building burns through money like nothing else. After several failed rocket launches, Musk deployed his signature skill: fire\u0026hellip; And just as the financial crisis hit and no one wanted to invest, he poured his entire personal fortune into both companies. After several failures, SpaceX\u0026rsquo;s Falcon rocket finally achieved the feat of being the first private company to successfully launch a satellite, landing a $1 billion NASA contract. Tesla, after shamelessly asking early Roadster customers for more money (because developing such a radically new-concept EV cost far more than projected), finally produced a finished vehicle and built out a highway EV charging network and an electric car factory. After simultaneously succeeding with two industry-disrupting companies, no one questioned Musk\u0026rsquo;s ability anymore.\nFor Humanity # Musk\u0026rsquo;s success is inseparable from his excellent qualities: sensitivity to future technology, rapid comprehension of new industries, talent identification, a free and open tech and market environment, long working hours and execution\u0026hellip; But the things people dislike include his ruthlessness toward employees — some loyal, devoted people, fired just like that. As a worker myself, I deeply understand the feeling of giving your all without recognition from the company. Reading this book, I could even feel how American capitalists truly exploit workers. Once, an employee missed a company gathering because he didn\u0026rsquo;t want to miss his daughter\u0026rsquo;s birth. Musk emailed him an angry tirade: do you want to wallow in domestic trivialities or work relentlessly to change the world? The guy just didn\u0026rsquo;t want to miss his daughter\u0026rsquo;s birth.\nA few years ago, reading Steve Jobs, I thought: how could someone be so obsessive? But that exact kind of person changed the mobile industry and brought about the smartphone revolution. Jobs was way too formidable. After reading Elon Musk, I now feel Musk is even stronger than Jobs. Tesla, SpaceX, SolarCity — they\u0026rsquo;re all oriented toward humanity\u0026rsquo;s future. The future world seems to have started its engines; you can see it slowly arriving.\nMusk\u0026rsquo;s Mars plan finally seems to have glimpsed some dawn. For decades, the American space industry had nearly stagnated. He brought a new model and once again made aerospace a hot field. But there are also uncertainties. If a crewed launch explodes and causes casualties, SpaceX could plunge back into the abyss. And if Tesla discovers a serious defect requiring a mass recall, the stock price would crash.\nIf you could be the first human to set foot on Mars, would you do it? Musk has thought about it, and he truly could become that person. But Musk wouldn\u0026rsquo;t do it. The book\u0026rsquo;s original words: I want to go, but I don\u0026rsquo;t have to. The point is to enable many people to go to Mars. It would be like the head of Boeing being a test pilot — for space exploration, that\u0026rsquo;s unwise. Even never going to space is fine. The point is to extend the lifespan of humanity as much as possible.\nWorking for humanity — this theme truly stirs the heart. I\u0026rsquo;ve played Civilization VI for days and nights on end, from stick-wielding primitives to igniting rockets, all for that moment of launch, when humanity becomes an interplanetary species and builds a new home on Mars!\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/book-notes-elon-musk/","section":"Posts","summary":" Gifted # Musk’s ancestors, driven by a love of adventure, emigrated from America to South Africa. His maternal grandfather even flew a plane from Africa to Australia. Musk was born in South Africa and showed astonishing memory and brilliance from an early age. His mother, Maye Musk, told his teacher: “My son is a genius.” The teacher replied, “Yes, every mother says that.” Maye: “No, I mean he really is a genius.” As a child, Musk sometimes seemed “slow to react.” His mother said when people talked to him, he’d give no response at all. She thought something was wrong with his brain and even took him to a doctor. But later she discovered Musk was simply immersed in his own world of thought. As a child, Musk could even finish reading the entire library’s collection and then ask the library to get more books…\n","title":"Book Notes — Elon Musk","type":"posts"},{"content":"Rich Dad Poor Dad. I used to scoff at this kind of book. It\u0026rsquo;s the type of success-literature you see displayed at bookstore entrances, looking insubstantial at a glance, very unreliable — the kind of thing that seems to prey on people at the bottom of society who dream of getting rich quick but can never actually apply the book\u0026rsquo;s advice due to their own circumstances or environment. Besides, smart people don\u0026rsquo;t read such uncultured books, right? The title is tacky as hell!\nI spent a period watching Lao Gao and Xiao Mo on Bilibili, and one episode talked about this book, making it sound mystical and mysterious. Plus, it\u0026rsquo;s a global bestseller, so I bought it to see just how magical it really was.\nStarted Reading E-Books # I\u0026rsquo;m a die-hard fan of paper books. I love the feeling of finishing an entire book and placing it on the shelf to collect — that \u0026ldquo;this whole bookshelf is my knowledge\u0026rdquo; feeling. I wasn\u0026rsquo;t really into e-books; they give a \u0026ldquo;read it and it\u0026rsquo;s gone\u0026rdquo; vibe. Three reasons brought me back to e-books:\nI recently deleted all my go-to time-killing apps and needed an app that wasn\u0026rsquo;t so brain-numbing — something I could open first when pulling out my phone. E-books are just more convenient than paper books; you can pull them out and read anytime. Making use of fragmented subway commute time. Back when I was preparing for graduate school entrance exams, I made a detailed daily schedule that included subway time. Since I\u0026rsquo;d already built the habit of studying on the subway, I didn\u0026rsquo;t want to give it up. I recommend my summary of the graduate exam experience: How I Got Into Wuhan University\u0026rsquo;s Part-Time Graduate Program\nI slightly adjusted my old plan — no need to grind vocabulary as intensely anymore — so I swapped in e-book reading. I split subway time into two blocks: morning commute and evening commute.\nIn the morning, my mind is clear and my mental state is good, so I read technical e-books. These require slow reading, sometimes stopping to think. This is goal-driven reading.\nIn the evening, my mind is foggy (not really foggy, more often it\u0026rsquo;s a headache), so I read lighter books — like \u0026ldquo;extracurricular\u0026rdquo; books such as Rich Dad Poor Dad. These books aren\u0026rsquo;t hardcore in content, so I read fast and enjoyably, with a bit of a dopamine-driven reading feel.\nPoor Dad and Rich Dad # Author Robert Kiyosaki grew up in Hawaii. His biological father was a highly educated government education official — the \u0026ldquo;Poor Dad.\u0026rdquo; His best friend\u0026rsquo;s father was a high school dropout with extraordinary financial intelligence — the \u0026ldquo;Rich Dad.\u0026rdquo; Poor Dad had higher education but worried daily about loans and bills, while Rich Dad spent every day directing people to create wealth for him. The book has a classic line: \u0026ldquo;The poor work for money; the rich make money work for them.\u0026rdquo;\nThe mindset of making money: As a child, the author and his good friend came up with many ways to earn money. They once used toothpaste tubes to counterfeit coins — not knowing at the time that making money was illegal — and were stopped by adults. Later, they gathered free books from stores, set up a little library in a spot, and earned money by renting books to neighborhood kids. They stopped after attracting some local unsavory characters. Rich Dad admired their money-making behavior. He believed the difference between the rich and the poor is that the rich are always thinking about how to make money, while the poor are only thinking about how to find a good job. On taxes: The poor pay far more in taxes than the rich. Rich Dad paid more taxes than Poor Dad, but Rich Dad\u0026rsquo;s income was vastly higher. When the US president decided to raise taxes on the rich, they only raised taxes on the salaried middle class — the truly rich were unaffected. The rich have many ways to legally avoid taxes, by understanding how to use the law. For example, the author says in real estate: if you sell a house, the income is heavily taxed, but if you swap houses, there\u0026rsquo;s no tax. The rich can use this statute to invest in real estate and legally avoid taxes. But the poor can\u0026rsquo;t escape income tax — the more you earn, the more tax you pay. On investing: Investing requires cultivating knowledge in accounting, finance, and law. In short, if you want to make money, you need to develop your financial intelligence. When you earn a big sum, you should start the next investment rather than buying consumer goods. Think of the tables, chairs, jars, bottles, clothes, and household items in our homes — we pay a relatively high price for them, but the moment they\u0026rsquo;re bought, their value drops to near zero. This isn\u0026rsquo;t to say don\u0026rsquo;t buy things — but consider investing your money first, then consider zero-return consumption. Finally # The book says our education system cultivates people\u0026rsquo;s ability to work, not their ability to make money. I strongly agree with this statement, but I still believe in the power of education. The author isn\u0026rsquo;t telling people to skip education — education is also very important for making money. We need to understand the basic operating principles and rules of this world, and that can help us find suitable ways to make money.\nGiven the author\u0026rsquo;s family background at the time, they may have been relatively poor compared to Rich Dad, but for truly poor people, their family conditions were far from poor. I feel my own circumstances haven\u0026rsquo;t yet reached the point where I can fully devote myself to investment and money-making. If one day I have some spare money and my management, social, and decision-making skills reach a certain level, I might look for ways to make money. For now, I can\u0026rsquo;t think that far — gotta code well and fill the holes first.\nThe Learning Pyramid in the book left a deep impression on me. You really retain very little from passive learning. That\u0026rsquo;s why I persist in writing frequently, including book notes like these. Problems I encountered during years of late-night database maintenance — I still remember them vividly. Hands-on experience truly creates the deepest memories. That said, hands-on experience is unpredictable and rare; reading and self-learning are the lowest-cost, easiest-to-form habits, and the most cost-effective way to improve ability. They\u0026rsquo;re not that \u0026ldquo;passive.\u0026rdquo; The Learning Pyramid\u0026rsquo;s \u0026ldquo;passive\u0026rdquo; refers to knowledge being received by the subject; \u0026ldquo;active\u0026rdquo; refers to the subject outputting knowledge. This also strengthens my motivation to record and share — whether technical or non-technical.\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/book-notes-rich-dad-poor-dad/","section":"Posts","summary":"Rich Dad Poor Dad. I used to scoff at this kind of book. It’s the type of success-literature you see displayed at bookstore entrances, looking insubstantial at a glance, very unreliable — the kind of thing that seems to prey on people at the bottom of society who dream of getting rich quick but can never actually apply the book’s advice due to their own circumstances or environment. Besides, smart people don’t read such uncultured books, right? The title is tacky as hell!\n","title":"Book Notes — Rich Dad Poor Dad","type":"posts"},{"content":"This is a book I spent a long time reading. It\u0026rsquo;s thick, covers an enormous range of topics, and tackling the original English edition was challenging. But thankfully, I finally finished it — today (February 2023). A real sense of accomplishment.\nSapiens: A Brief History of Humankind is a grand history book that comprehensively introduces the development of human civilization. I\u0026rsquo;ve always enjoyed learning about human history, immersing myself in its weight and the vitality of civilizational progress.\nThe Cognitive Revolution and Fiction # Conventional views of human history hold that humanity\u0026rsquo;s first major evolution or revolution was learning to use tools. Like in 2001: A Space Odyssey, where apes bang bones together as the iconic BGM plays — but that\u0026rsquo;s science fiction. Sapiens argues that humanity\u0026rsquo;s first major revolution was the Cognitive Revolution, the key distinction between humans and animals. Learning to walk upright didn\u0026rsquo;t just free our hands — more importantly, it freed our minds. Four-legged running animals never evolved the way we did because harsh natural environments demanded stronger bodies and limbs for speed. Walking upright obviously makes you slower, so group living and tool use compensated. But group living and tools aren\u0026rsquo;t unique to Sapiens — many animals live in groups, and chimpanzees use tools too. What set Sapiens apart was learning to manufacture weapons, boats, and sustain much larger social groups. They walked from Africa to the Middle East, to Europe, battling the physically stronger Neanderthals and ultimately taking their territory. They reached the Far East, crossed the Bering Strait into the Americas, and even sailed to Australia. This ability to craft complex tools and communicate at unprecedented levels — that\u0026rsquo;s what the Cognitive Revolution brought.\nNeanderthals themselves have gone extinct, but recent research shows the vast majority of humans carry a small amount of Neanderthal DNA — except for indigenous Africans. This suggests Neanderthals weren\u0026rsquo;t entirely wiped out by Sapiens; a small number interbred with Sapiens and their genes spread across the world. This is also key evidence supporting the Out-of-Africa theory of human origins.\nThe book gives a classic example of the Cognitive Revolution: imagine a lion by the river. One Sapiens sees it and tells others. The others then construct in their minds the idea that \u0026ldquo;there is a lion by the river\u0026rdquo; — even though they don\u0026rsquo;t know for certain whether one is actually there. The prerequisite is that Sapiens had to learn to conceive of things that aren\u0026rsquo;t immediately present. More importantly, once they mastered this skill, language, fiction, lies, power, social structures followed\u0026hellip; Neanderthals clearly exchanged far less information than Sapiens.\nThe Cognitive Revolution had an enormous impact on civilizational development. It allowed the construction of things that don\u0026rsquo;t actually exist — gods, religions, power, money, social structures, dynasties\u0026hellip; Take a company, for example. A company is really a social construct; it doesn\u0026rsquo;t actually exist in the physical world. A company can be a stack of 4A paper with a stamp in a document bag — but that\u0026rsquo;s just paper. Employees believe the company exists because their minds believe it does. Everyone believes it exists, but the company itself is a fiction in human minds — the entity \u0026ldquo;company\u0026rdquo; does not exist in the real world.\nMoney # How did money come about? In a world without money, stable social structures gave rise to barter trade. But as the variety of traded goods increased, the number of equivalent exchange equations grew exponentially. When trading shoes for a rake, it\u0026rsquo;s a simple one-for-one swap. Add a donkey, and you have three exchange equations: shoes-rake, rake-donkey, shoes-donkey. As goods multiply, the number of exchange equations becomes a combinatorial explosion — and that\u0026rsquo;s not even accounting for multi-item trades. Then an intermediary — money — appeared and solved the problem instantly. Everything only needed to be equated with money. Money served as the universal equivalent for all goods, and the convenience of trade improved dramatically. Early forms of money were diverse, with shells being the most common. If shells were too easy to obtain, someone could buy up everything in the market, so shell-based monetary civilizations were typically inland. Since people carried money in their pockets to buy things or hoarded it at home, and worried that too-easy acquisition of currency would disrupt markets, gold — rare, resistant to decay, difficult to mine — became humanity\u0026rsquo;s primary currency for long periods. In ancient Europe, many kings minted gold coins bearing their portraits or logos, resulting in a vast variety of European gold coins. Ancient China was somewhat different: starting with shells unearthed at Sanxingdui, then bronze coins during the Spring and Autumn and Warring States periods, then gold, silver, copper, and paper money (jiaozi) across dynasties. China didn\u0026rsquo;t stick to gold like Europe did, mainly because the population was too large and gold reserves too small, making gold too valuable — they needed other metals to create a monetary gradient to smooth trade across different scales.\nThere\u0026rsquo;s another great insight in the book: money and religion both have a certain transmissibility. Money and religion are essentially no different — they are both human constructs, fictions. Their only difference: religion tells you what you should believe, while money tells you what others believe.\nColumbus and Zheng He # The Age of Discovery, the early Industrial Revolution. Europeans were passionate about exploring the world\u0026rsquo;s unknown territories. After Europeans learned the Earth was round, lacking good surveying tools, Columbus set sail westward from Europe aiming for India. They crossed the Atlantic and reached a landmass, encountered the locals, thought they\u0026rsquo;d reached India, and called them \u0026ldquo;Indians.\u0026rdquo; To this day, \u0026ldquo;Indian\u0026rdquo; in the United States carries both meanings. Europeans realized the world still had many corners untouched (at least by relatively modern civilization). They redrew world maps, filling unknown regions with sea monsters and leviathans. These maps are still widely used in video games — for instance, Civilization VI uses sea monster maps for unexplored territory, waiting to be discovered. Europeans eagerly sought new lands, and soon South America, New Zealand, Australia, and countless small islands were discovered and claimed. Where local civilizations were too far behind — the Aztec, Native American, Māori, Tasmanian civilizations — they were brutally massacred, their lands occupied by white settlers.\nWhen the Aztec civilization encountered Spaniards clad in gleaming iron armor and wielding sharp iron swords, they thought those men were gods. They couldn\u0026rsquo;t comprehend such hard clothing and weapons — they must have been sent by the gods. And then they were deceived and slaughtered by \u0026ldquo;higher civilization.\u0026rdquo;\nZheng He\u0026rsquo;s ships were called \u0026ldquo;dragon boats\u0026rdquo; (the original text says this, even includes illustrations — I think it may be a mistake, or Westerners assumed any ship with a dragon figurehead was a dragon boat). They were several times larger than Columbus\u0026rsquo;s ships and set sail one to two centuries earlier. Zheng He\u0026rsquo;s fleet, with far superior technology, discovered new lands but didn\u0026rsquo;t occupy them — they traded with the locals. The book argues that Europeans were more adventurous and aggressive, thus ushering in the Age of Discovery. It seems Europeans hold a relatively friendly view of the Ming Dynasty. In Civilization VI, only three Chinese leaders appear: Qin Shi Huang, Wu Zetian, and Zhu Di — and only Zhu Di of the Ming Dynasty is the \u0026ldquo;tall build\u0026rdquo; development-focused leader.\nClosing # Retracing the development of human civilization lets us understand where we came from, what we\u0026rsquo;re doing now, and explore where we\u0026rsquo;re headed. This love for the subject is also why I enjoy strategy games like Civilization VI and Humankind. When you plant rice, domesticate horses, mine salt, iron, coal, oil, uranium\u0026hellip; there\u0026rsquo;s a thrill of human progress.\nI\u0026rsquo;d like to close with a quote from Civilization VI, a game I\u0026rsquo;ve played for over 400 hours: \u0026ldquo;From the first stirrings of life beneath the water\u0026hellip; to the great beasts of the Stone Age\u0026hellip; to man taking his first upright steps, you have come far. Now begins your greatest quest.\u0026rdquo;\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/book-notes-sapiens-a-brief-history-of-humankind/","section":"Posts","summary":"This is a book I spent a long time reading. It’s thick, covers an enormous range of topics, and tackling the original English edition was challenging. But thankfully, I finally finished it — today (February 2023). A real sense of accomplishment.\nSapiens: A Brief History of Humankind is a grand history book that comprehensively introduces the development of human civilization. I’ve always enjoyed learning about human history, immersing myself in its weight and the vitality of civilizational progress.\nThe Cognitive Revolution and Fiction # Conventional views of human history hold that humanity’s first major evolution or revolution was learning to use tools. Like in 2001: A Space Odyssey, where apes bang bones together as the iconic BGM plays — but that’s science fiction. Sapiens argues that humanity’s first major revolution was the Cognitive Revolution, the key distinction between humans and animals. Learning to walk upright didn’t just free our hands — more importantly, it freed our minds. Four-legged running animals never evolved the way we did because harsh natural environments demanded stronger bodies and limbs for speed. Walking upright obviously makes you slower, so group living and tool use compensated. But group living and tools aren’t unique to Sapiens — many animals live in groups, and chimpanzees use tools too. What set Sapiens apart was learning to manufacture weapons, boats, and sustain much larger social groups. They walked from Africa to the Middle East, to Europe, battling the physically stronger Neanderthals and ultimately taking their territory. They reached the Far East, crossed the Bering Strait into the Americas, and even sailed to Australia. This ability to craft complex tools and communicate at unprecedented levels — that’s what the Cognitive Revolution brought.\n","title":"Book Notes — Sapiens: A Brief History of Humankind","type":"posts"},{"content":" What is pg_rewind? # pg_rewind is a PostgreSQL-provided tool. When the timelines of two PG instances diverge, pg_rewind can synchronize them. (For example, the primary is running, the standby failover has been running for a while — at this point the primary and standby timelines have diverged.)\npg_rewind compares the sizes of files between the source and target, then copies differing files from source to target, including configuration files. However, it does not compare unchanged files, so pg_rewind runs efficiently on large databases with few changes.\npg_rewind can be used after a standby failover: even if the standby has been running independently for some time, it can be pulled back to the same state as the primary and become a standby again.\nDuring execution, pg_rewind compares the divergence point between primary (source) and standby (target), and transmits the primary\u0026rsquo;s WAL logs after the divergence point to the standby. Therefore, if the primary\u0026rsquo;s WAL after the divergence point is also lost, rewind won\u0026rsquo;t copy nonexistent WAL logs, and the standby will still fail to become a standby. The solution is to use restore.\n!!! When using pg_rewind, back up the target instance. pg_rewind directly overwrites the target database\u0026rsquo;s files. If rewind fails, the target database may be unable to start.\nUsing pg_rewind # After a primary-standby switchover, the old primary continues running, causing timeline inconsistency. The old primary cannot start as a standby for the new primary.\nWhen attempting to start the standby, a timeline error appears:\nLOG: entering standby mode FATAL: requested timeline 2 is not a child of this server\u0026#39;s history DETAIL: Latest checkpoint is at 0/6000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/4000098. LOG: startup process (PID 22321) exited with exit code 1 LOG: aborting startup due to startup process failure LOG: database system is shut down At this point, rewind is needed to realign the primary and standby.\nConfigure pg_hba on the current primary Set up login permissions for the pg_rewind user to access the source database. hba changes require a database restart. vi $source/pg_hba.conf host all pg 172.17.100.150/32 trust pg_rewind requires a high-privilege user. Newer PG versions allow granting privileges; older versions should use a superuser. My environment is PG 9.6, so I use the OS superuser directly.\nwal_log_hints = on parameter configuration Append wal_log_hints = on to the target database\u0026rsquo;s postgres.conf, then start and shut down the target database once (at this point the primary is running and the standby is shut down). vi $dest/postgres.conf wal_log_hints = on Execute pg_rewind [pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_rewind --target-pgdata /pg/pg96data_pri --source-server=\u0026#39;host=172.17.100.150 port=5433 user=pg password=oracle dbname=postgres\u0026#39; servers diverged at WAL position 0/4000098 on timeline 1 rewinding from last common checkpoint at 0/4000028 on timeline 1 Done! Configure standby parameters Modify IP, port, directory, etc. in postgres.conf and recovery.conf. pg_rewind also copies configuration files over. [pg@lzl pg96data_pri]$ mv recovery.done recovery.conf [pg@lzl pg96data_pri]$ vi recovery.conf [pg@lzl pg96data_pri]$ vi postgres.conf Start the standby [pg@lzl pg96data_pri]$ /pg/pg96/bin/pg_ctl -D /pg/pg96data_sla -l /pg/pg96data_sla/server.log start server starting [pg@lzl pg96data_sla]$ psql -p5433 postgres psql (9.6.17) postgres=# \\x Expanded display is on. postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 24766 usesysid | 16384 usename | lzl application_name | walreceiver client_addr | 172.17.100.150 client_hostname | client_port | 47345 backend_start | 2021-07-30 07:44:05.582546+00 backend_xmin | state | streaming sent_location | 0/4033790 write_location | 0/4033790 flush_location | 0/4033790 replay_location | 0/4033790 sync_priority | 0 sync_state | async Common Issues # pg_rewind Error 1 # could not fetch remote file \u0026#34;global/pg_control\u0026#34;: ERROR: must be superuser to read files Failure, exiting Solution: Use a high-privilege user.\npostgres=# \\du List of roles Role name | Attributes | Member of -------------+------------------------------------------------------------+----------- lzl | Replication | {} pg | Superuser, Create role, Create DB, Replication, Bypass RLS | {} rewind_user | | {} The pg user is the built-in superuser that comes with the PG server, matching the PG installation user. The OS installation user certainly has permission to modify pg_control.\n/pg/pg96/bin/pg_rewind --target-pgdata /pg/pg96data_pri --source-server=\u0026#39;host=172.17.100.150 port=5433 user=pg password=oracle dbname=postgres\u0026#39; pg_rewind Error 2 # could not connect to server: FATAL: no pg_hba.conf entry for host \u0026#34;172.17.100.150\u0026#34;, user \u0026#34;rewind_user\u0026#34;, database \u0026#34;postgres\u0026#34; Failure, exiting No pg_hba.conf entry configured for the connection. Solution: Configure pg_hba for the user, e.g.:\nhost all pg 172.17.100.150/32 trust pg_rewind Error 3 # [pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_rewind --target-pgdata /pg/pg96data_pri --source-server=\u0026#39;host=172.17.100.150 port=5433 user=pg password=oracle dbname=postgres\u0026#39; target server needs to use either data checksums or \u0026#34;wal_log_hints = on\u0026#34; Root causes:\nfull_page_writes (enabled by default) wal_log_hints must be set to on, or PG must have checksums enabled at initdb time. Solution: Add wal_log_hints = on to the target database\u0026rsquo;s postgres.conf, then start and shut down the target database once (the target was already shut down — it must be started and shut down again for the parameter to take effect).\nvi postgres.conf # add to target database config wal_log_hints = on Restart the target database to apply:\n[pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_ctl -D /pg/pg96data_pri -l /pg/pg96data_pri/server.log start server starting [pg@lzl pg96data_sla]$ /pg/pg96/bin/pg_ctl -D /pg/pg96data_pri -l /pg/pg96data_pri/server.log stop waiting for server to shut down.... done References # https://www.postgresql.org/docs/9.6/app-pgrewind.html\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/getting-started-with-pg_rewind/","section":"Posts","summary":"What is pg_rewind? # pg_rewind is a PostgreSQL-provided tool. When the timelines of two PG instances diverge, pg_rewind can synchronize them. (For example, the primary is running, the standby failover has been running for a while — at this point the primary and standby timelines have diverged.)\npg_rewind compares the sizes of files between the source and target, then copies differing files from source to target, including configuration files. However, it does not compare unchanged files, so pg_rewind runs efficiently on large databases with few changes.\n","title":"Getting Started with pg_rewind","type":"posts"},{"content":" Why Did I Want to Pursue a Part-Time Master\u0026rsquo;s? # To improve my academic credentials. My undergraduate degree is from an ordinary university. A higher degree can add a bit of competitiveness in my career. I once submitted my resume to a state-owned enterprise and was completely ghosted. But a colleague with better academic credentials in the same office got through. So for state-owned enterprises, higher education is the knock on the door. To make up for failing the graduate entrance exam as a senior and revive the dream of graduate studies. Learning is never wrong — this is my creed. Differences Between Full-Time and Part-Time Graduate Programs # Study Mode # Full-time means you quit your job; part-time allows you to keep working. This basically locks in part-time as the only option for most working people.\nExam Scope # Full-time exams are more demanding, usually covering four subjects: advanced math, graduate English, politics, and a specialized course. Part-time exams are less demanding, covering two subjects: the Management Comprehensive Exam (middle school math, logic, writing) and graduate English. Except for English, which is similar to the full-time version, the management comprehensive exam content is much easier than the full-time track (more on this later).\nResearch Direction # Full-time graduate programs lean toward research, emphasizing learning and research output, cultivating students\u0026rsquo; learning and research abilities.\nPart-time programs lean toward enhancing students\u0026rsquo; management skills, delivering management-oriented talent to society.\nThese two directions are quite different.\nSocial Recognition # Full-time graduate degrees certainly carry more recognition than part-time ones. After all, the bar is higher, the study pressure is greater, it\u0026rsquo;s the mainstream path, and society recognizes it more. But part-time degrees do carry recognition too — many schools have explicitly stated they treat both equally (on paper). Most importantly, part-time graduate students hold dual certificates (degree certificate and diploma).\nAs for employment, it depends on the employer. Some positions only require any graduate degree, while others may explicitly state \u0026ldquo;full-time graduate degree required.\u0026rdquo; But for those who can\u0026rsquo;t quit their jobs to pursue higher education, part-time is practically the only path. In summary: Part-time also grants dual certificates, but part-time recognition \u0026lt; full-time recognition.\nHow to Choose a Major # Consider your job nature, career aspirations, and your wallet.\nHR, finance, or corporate executive: MBA\nTechnical roles or engineering management: MEM\nCivil servant or public administration: MPA\nAccounting: MPAcc. There are a few other niche options — search online.\nFrom a financial perspective, tuition varies by school but generally follows similar ranges. Taking Sichuan University as an example: MEM costs about 15,000 yuan per year, MBA about 150,000 yuan per year, MPA roughly similar to MEM.\nFor an IT professional like me, with a thin wallet and not seeing myself as an executive, MEM is the better fit.\nHow to Choose a School # Since the study difficulty is relatively low and social recognition isn\u0026rsquo;t as high as full-time, I recommend choosing a prestigious local university. 211 and 985 universities are strongly recommended — just pick one you like. Many 985 universities set their admission cutoff at the national line, so I personally feel that non-211/985 schools aren\u0026rsquo;t worth applying to. If the scores are the same, why not choose a better school?\nOf course, some 985 universities set their own cutoff lines. You\u0026rsquo;ll need to check the school\u0026rsquo;s department website for historical admission scores. For example, Sichuan University sets its own line every year, typically 20–30 points above the national line.\nHow Does the Exam Work? # Exam Content # The exam is divided into the preliminary exam and the re-examination. The preliminary exam is in late December; the re-examination is in March.\nThe preliminary exam is a written test. After registering for the exam and selecting a test venue, you take it in late December — finished in one day, each session 3 hours.\nThe re-examination is an interview. A few schools add a written component, but since the pandemic, it\u0026rsquo;s all been online interviews — rarely do you need to write anything during the interview.\nPreliminary exam content (Management Comprehensive Exam): Re-examination content: Early Interview # The early interview means the school arranges an interview before the preliminary exam — effectively moving the re-examination earlier. Once you pass the early interview, you only need to reach the national line on the preliminary exam. Under the normal process, you\u0026rsquo;d need to exceed the school\u0026rsquo;s own cutoff line.\nEarly interviews are only offered by some schools. For example, Tsinghua has an early-admission interview; Sichuan University doesn\u0026rsquo;t. You\u0026rsquo;ll need to check the official website of your target school.\nIf you pass the early interview, the pressure on the preliminary exam is indeed lighter.\nHow to Register # The two most important websites in the graduate exam process are your target school\u0026rsquo;s official website and the China Graduate Admission Website (研招网, YZW). Before registering, check the master\u0026rsquo;s program catalog for your target school and major. For example, part-time engineering management should be selected as follows: Should You Sign Up for a Training Course? # Many people wonder whether to sign up for a training course. Signing up feels too expensive — what if you don\u0026rsquo;t pass? Not signing up means you don\u0026rsquo;t know how to study, or studying feels too exhausting.\nI have some authority on this question, because I did sign up for one.\nI saw a training course online and asked about the price — 8,000 yuan. On top of that, there was an information gap: I didn\u0026rsquo;t know what the exam covered, how to study, how to register, which school to apply to, or where to search for this information. (Searching \u0026ldquo;part-time graduate\u0026rdquo; on Baidu immediately yields nothing but ads.) Plus, I was genuinely determined to study at the time. So I fell into this trap\u0026hellip;\nWhat Did the Training Course Give Me? # First, a pile of study materials — study methods, past exam papers, and so on. Other than the English vocabulary list, which I immediately started memorizing, I barely touched anything else. I printed the past exam papers too, but I never looked at them — not even by the time the exam was over. They looked useful, but in reality, you can search past papers on Taobao and find plenty of officially published versions with detailed explanations — far more useful and less straining on the eyes. And that vocabulary list — strangely, the one the course gave me didn\u0026rsquo;t match the one in Zhang Jian\u0026rsquo;s Yellow Book. I memorized the course\u0026rsquo;s vocabulary for a long time, only to find that some common exam words weren\u0026rsquo;t in the list. Later I switched to Zhang Jian\u0026rsquo;s Yellow Book vocabulary and it felt much better.\nBeyond study materials, the most important component was live-streamed lectures, typically 8–10 PM — two hours of teaching and ten minutes of Q\u0026amp;A.\nThe live lectures were useful, especially logic and math. Just listening to those two subjects essentially eliminated the need to buy extra books to thoroughly study the fundamentals of math and logic — you only needed to do the post-class exercises and practice problems. I barely listened to the English classes; I mostly self-studied. Personally, I felt that listening to English lectures was very inefficient and a waste of time — better to memorize more words and do more reading exercises. I only listened to the last two sessions of English writing, which were extremely useful. More on English writing later (with practical tips). Finally, don\u0026rsquo;t fantasize about asking the teacher questions — these online classes have many students, and the Q\u0026amp;A time is only about ten minutes. My questions were never picked.\nBenefits of a training course: Convenience of learning. From a working person\u0026rsquo;s perspective, you\u0026rsquo;re already working overtime a lot. Coming home exhausted, expecting yourself to spread out materials and study like it\u0026rsquo;s the gaokao — too hard. But if it\u0026rsquo;s a lecture, you just sit on the sofa and watch the livestream. That\u0026rsquo;s much easier. Saves time. No need to laboriously make a study plan and constantly adjust it. Listening to lectures is also easier than reading through a thick textbook on your own. Essentially, a training course is trading money for time.\nThe good learning state of classmates motivates you. You\u0026rsquo;re not studying alone with no idea how others are doing.\nThe pitfalls of training courses: Quality varies widely. The institution I signed up with was Shangde. I didn\u0026rsquo;t research them beforehand — they were quite mediocre. Some of their programs are contract-based: they refund if you don\u0026rsquo;t pass, but there were traps in the contract, and no one got refunds. Our group had many people fighting for refunds. Also, personal information leaks — basically every student received refund scam calls. Even I, who passed, got five or six such calls.\nTeacher quality is uneven. Some teachers were excellent; others seemed like they were just coasting. Some explanations were outright misleading. In my class, the math and logic teachers were especially good, English was garbage, and writing was misleading\u0026hellip;\nDon\u0026rsquo;t fantasize that a training course will train you into a great candidate. The course is only an aid — it mainly depends on you. From the day I started planning for the exam until the preliminary exam was over, I had basically zero weekends — every one was spent in the library or a café. I declined every social gathering.\nSo, should you sign up for a training course?\nI think if you meet all of the following conditions, you can consider it:\nEnough resolve. Since you\u0026rsquo;ve paid, don\u0026rsquo;t let it go to waste. I also don\u0026rsquo;t recommend refund-based programs that give you an escape route. Enough money. Online courses start at a few thousand yuan — my 8,000 yuan can serve as a reference. In-person courses are more expensive but offer face-to-face tutoring. Unable to bridge the information gap. The information gap may prevent you from planning your own study schedule. If the information gap is what drives you toward a course, I suggest looking at others\u0026rsquo; study plans and successful Bilibili uploaders\u0026rsquo; cases. The biggest source of information is always the school\u0026rsquo;s official website. If you have the time, energy, insufficient funds, or decent learning ability, you absolutely don\u0026rsquo;t need to sign up. In that case, making a study plan that suits you is especially important.\nThe Preliminary Exam # Before the preliminary exam is over, focus only on preparing for the preliminary exam. Generally speaking, you can prepare for the re-examination content after the preliminary exam is done. Preparing for the preliminary exam is the core of your studies, the most energy-consuming and competitive phase — this is where success is decided.\nHow to Prepare for the Preliminary Exam — Study Plan # To prepare for the preliminary exam, you need a study plan that fits you, and you need to throw yourself into it completely.\nThe study plan is extremely, extremely, extremely important. You need to first examine yourself — what are your strengths and weaknesses, your circumstances, which subjects you\u0026rsquo;re unfamiliar with, and which ones need long-term study.\nEveryone\u0026rsquo;s situation is different. Let me first share my study plan — you can reference my approach to customizing a plan and my study methods. Because the pressure isn\u0026rsquo;t as high (compared to full-time), I strongly recommend starting in July or August. Starting too late means not enough time; starting too early makes it easy to slack off. Total study time should be 5–6 months. But if your English is really poor, start memorizing words a few months earlier.\nMy personal conditions:\nNot enough time — often worked overtime until 9 PM, weekends generally off. Commute by subway, two hours total both ways. Math almost completely forgotten, never touched logic before, Chinese writing has been terrible since childhood, decent English vocabulary, reading comprehension fine, English writing completely unable. Given the study pressure and my personal conditions, my plan needed to be:\nMemorize words. English is definitely the most time-consuming — it requires sustained, long-term vocabulary memorization. Before all other studying, memorize English II vocabulary. Since morning memory retention is best, I memorized words on the subway to work every day and also on weekend mornings. From August until the preliminary exam. English reading. Actually, once you\u0026rsquo;ve memorized the words, reading comprehension is easy. English II doesn\u0026rsquo;t have many long, complex sentences — if you know all the words, reading is no problem. But I personally enjoy English reading, so I scheduled daily reading of English originals. I can\u0026rsquo;t say it had a huge impact, but it wasn\u0026rsquo;t useless — consider it a supplement to exam prep. Importantly, hobbies make habits easier to form. Math and logic have similar study difficulty. Even though I was completely clueless, they were relatively easy to learn (the concepts are simple; the exam itself is another story — but more on that later). I studied math or logic from 8 PM to 10 PM on weekday evenings (mostly attending lectures — if you don\u0026rsquo;t have lectures, buy materials and self-study). This is also long-term study: early phase learning concepts, late phase practicing. Since I couldn\u0026rsquo;t always leave work on time, sometimes I had to use the evening commute and the one-hour lunch break to complete the daily math and logic tasks. (Never fall behind — one day of delay leads to a huge backlog.) Chinese writing. Prepare about one month before the exam — late November or early December. Look at writing materials and try writing yourself. Don\u0026rsquo;t aim for perfection — the main thing is to express the core idea clearly. Trust me, during the exam you\u0026rsquo;ll absolutely be writing in frantic cursive. English writing. Prepare about one month before the exam. Remember: absolutely, absolutely do not memorize model essays. Not only is it brutally hard to memorize them, they\u0026rsquo;re nearly impossible to adapt. Memorize 2–3 templates before the exam, then practice with past exam topics using the templates. You only need to swap in words — no situation where you pick up the pen and can\u0026rsquo;t write a single word. So, my weekly plan: This arrangement felt quite suitable for me — it fully utilized fragmented time and made good use of weekends. The first 3–4 months build the foundation: English\u0026rsquo;s foundation is vocabulary, logic and math\u0026rsquo;s foundation is concepts. Weekly study on workdays could be consolidated and practiced on weekends. The final 1–1.5 months are mainly for writing and getting the feel of past papers.\nRecommended Study Materials # English: Zhang Jian\u0026rsquo;s Yellow Book. Just buy the vocabulary book and past exam papers. Baicizhan (vocabulary app), iReading, WeChat public account: 考研英语外刊 (Graduate English Foreign Journals). English writing: I don\u0026rsquo;t recommend any purchasable writing guide. Use universal templates; don\u0026rsquo;t memorize model essays. Math: Chen Jian\u0026rsquo;s Math High Score Guide, past exam answer keys. Logic: Zhonggong\u0026rsquo;s Logic Easy Pass, past exam answer keys. Management comprehensive writing: Buy a popular one — they\u0026rsquo;re all not great. No need to master writing too deeply. Don\u0026rsquo;t buy practice problem books — buy past papers directly. The quality of existing practice problems doesn\u0026rsquo;t compare at all to past papers. For English, no need to buy practice books — directly buy past papers. For math and logic, beyond the built-in exercises that come with foundational study, don\u0026rsquo;t buy extra practice problem books. I did math practice problems for a short period — very time-consuming and ineffective. The key for math and logic is to solidify the fundamentals, cover all the concepts, then do past papers and review the answer explanations. In short, immersive study time (like weekends) should only be spent on past papers. Do the last 20 years\u0026rsquo; worth of papers, then cycle through them again. (20 sets of past papers — only 2 per weekend — takes over two months to finish one round; by the time you redo them, you\u0026rsquo;ve largely forgotten the earlier ones.) Always save the most recent two years\u0026rsquo; papers untouched — use them for timed self-testing two weeks before the exam.\nEnglish Study # Vocabulary # Morning vocabulary memorization, rain or shine (it works especially well on the subway\u0026hellip;). Baicizhan — some people like using it. I used it early on too, but I found it ineffective. It covers the entire vocabulary pool; after a year you may not even complete one full cycle, and you\u0026rsquo;ve long forgotten what you studied earlier. So I stopped using it later. I strongly recommend my personal vocabulary method.\nEveryone\u0026rsquo;s vocabulary is different. At the start, you must go through all the words once (graduate exam vocabulary is about 5,000 words) and pull out the ones you don\u0026rsquo;t know onto a vocabulary list. Since carrying a vocabulary notebook on the subway is slightly awkward, I put it on my phone.\nI have a dedicated vocabulary photo album:\nWhen memorizing, open it, zoom in — effectively covering the definitions while memorizing. Cycle through like this. At first I did 2 pages a day, advancing 1 page a day. Later, 4 pages a day, advancing 4 pages a day. No matter what, cycle through — memorize until you can cover the definition and know the word\u0026rsquo;s meaning. For words easily confused, add them to the list, take another photo, and update the album.\nBefore my preliminary exam, I had cycled through these words six or seven times. Basically, aside from beyond-syllabus words, there was nothing I didn\u0026rsquo;t know.\nReading # English total score: 100. Reading ability components = Cloze 10 points + Reading Comprehension 40 points + New Question Type Reading 10 points = 60 points. No matter how poor your English ability, reading comprehension cannot be weak. My reading ability mostly came from daily foreign journal reading, such as iReading and 考研英语外刊. About 20 minutes a day, light study pressure. (Actually, it\u0026rsquo;s mainly vocabulary — if you know the words, sentences are easy to understand.)\niReading: \u0026ldquo;Love the World\u0026rdquo; foreign journals, one passage a day. Under Reading Plan → More Collections → Foreign Journals → Love the World, subscribe to the monthly issues, one a day. Relatively easy, good for early-stage reading improvement. WeChat public account: 考研英语外刊, one passage a day. Updated daily. This account is very well done, highly recommended. Just harder — good for later-stage challenge. If you don\u0026rsquo;t fully understand, that\u0026rsquo;s fine; I sometimes couldn\u0026rsquo;t fully grasp it either, since the difficulty is a bit high. Read along with morning vocabulary memorization. If short on time, you can also read on the way home.\nLong and complex sentences: English II doesn\u0026rsquo;t have many. Some people dedicate time specifically to studying them. If you want to study long sentences specifically, I especially recommend Liu Xiaoyan\u0026rsquo;s Long Sentences video series (just search on video sites — they\u0026rsquo;re all free). It\u0026rsquo;s very engaging and well-organized, easy to stick with. As for me, I only watched the simple sentence part of Liu Xiaoyan\u0026rsquo;s course and stopped. Because, first, I found that as long as you know the words, you basically understand the sentences; second, the videos are too long and numerous, taking up too much study time.\nWriting # Writing is divided into short composition (letter or notice, 10 points) and long composition (data analysis essay — bar chart/pie chart analysis, 15 points).\nAgain, do NOT memorize model essays. Before the exam I bought a writing book and memorized 10 model essays — truly, truly excruciating to memorize, and impossible to adapt. After memorizing the model essays, the first time I attempted an English writing question, I couldn\u0026rsquo;t write a single word — no exaggeration.\nThe most valuable thing in my training course was the English templates. Using the templates, I worked through all the past years\u0026rsquo; English writing topics — every single one could be adapted. The number of words to swap in doesn\u0026rsquo;t exceed twenty; you just need to be able to write simple sentences. Here are the templates:\nShort composition template — Letter:\nDear Sir or Madam, I am an undergraduate who majos in Applied English in this/a university.I am writing this letter for the purpose of doing sth. 1.It,first an formost,is my idea that not only ... but also 2.Then more importantly,so ... that... 3.The last on I must point out is that 简单句,which could be accepted by the majority of 人/. So It is the very moment for me to do ...,And I am looking forward to your reply. yours truly, xxx. Where \u0026ldquo;doing sth\u0026rdquo; includes:\n1.感谢信:expressing my genuine gratitude for your kind help 2.建议信:making some suggestions concerning sth. 3.投诉信:making my complaints concerning sth. 4.祝贺信:show my sincere congratulations to you because 句子 5.道歉信:offer my sincere apology to you because 句子 6.邀请信:invite you to participate in 活动 on behalf of 某人/组织 7.通知信:have 某人 informed that 句子 The letter template works for all types of letters.\nBesides letters, the short composition may — with low probability — test notices. The notice format differs from letters.\nShort composition template — Notice:\nNotice In an effort to do sth,I woud like to offer you some detailed information about it. The 活动 will be held in the school auditorium at 7 p.m.,next Saturday,December 28th and the requirements for sth. are listed as follows. 主段内容同书信... If you have any questions,please feel free to send on email to studentsunion@123.com or call 1234567.We are looking forward to your participation. The long composition essentially only involves analyzing bar charts and pie charts. The data falls into two categories: comparing magnitudes and comparing trends. Only the first paragraph differs between the two; the latter two paragraphs are the same.\nLong composition template:\n(比大小首段)The diagram clearly shows/illustrates/d that 句子/词组(the purposes of/attitudes toward/the proportions of) among participants/respondents in a certain college. Based on the data offered,one can distinctly see that 对象1 ranks the first/highest among all the categories,accounting for 数据1.Next are 对象2 and 对象3 with 数据2 and 数据3 respectively ,while 对象4 only constitutes 数据4. (比趋势首段)The diagram clearly illustrates how 话题 changed during the past several years.Based on the data provided,one can distinctly see that the number of 对象1 rose/fell significantly/slightly/gradually from 数据 in 年 to 数据 in 年,while that the number of 对象2 experienced a gradual/significant increase/decrease during the same period,reaching 数据 in 年. From my standpoint,there are two fundamental factors that are responsible for this scence.To begin with,the first contributing factor is that 句子.In addition,another important factor that cannot be ignored is that 句子. In view of the analysis above,we can conclude that it is of little surprise to see this phenomenon in the current era.Therefore,it can be predicted that 名词词组/动词ing will still take up a large share in the future. Writing without templates — relying on your own ability — is extremely difficult. Getting an ultra-high score with templates is hard, but getting 70–80% of the score is no problem, and the upfront investment is basically zero. After memorizing the templates, just write through all the past years\u0026rsquo; writing topics once.\nMath and Logic Study # Math questions: # Logic questions: # Math and logic are both multiple choice — nothing special to say. Early phase: learn concepts. Later phase: improve speed.\nMath and logic have a huge number of concepts. The early study phase takes 3–4 months, 2–3 hours a day, to learn all the concepts. After mastering the concepts, practice with past papers and review answer explanations. In the final month, practice with a stopwatch to improve speed: math questions within 70 minutes, logic within 60 minutes.\nChinese Writing Study # The management comprehensive essay is divided into Argument Validity Analysis and Argumentative Essay.\nArgument Validity Analysis is essentially nitpicking — get a writing guide and look through it; it\u0026rsquo;s not hard. You need to find the logical flaws in a lengthy passage of material. When writing, find four problem points. If you haven\u0026rsquo;t studied it, you might struggle to find them; after studying, finding four points is fairly easy. Don\u0026rsquo;t worry about naming the flaws precisely — overgeneralization, equivocation, false dichotomy, etc. Just write \u0026ldquo;xxx does not lead to xxx.\u0026rdquo;\nThe Argumentative Essay mainly involves interpreting a short passage. The key is not to misinterpret the theme. Finding the theme is also challenging at first — look at more sample materials to get a feel for it; generally you can locate the theme. The standard structure is introduction-body-conclusion. I recommend using \u0026ldquo;individual – enterprise – nation\u0026rdquo; as the framework (intro – individual – enterprise – nation – conclusion, 5 paragraphs total). Pick a few tried-and-tested points to plug in. Some students with strong writing skills write argumentative essays using other approaches — I certainly admire that. But writing time is extremely limited. Unless you\u0026rsquo;re naturally gifted with lightning-fast thinking, I recommend using a formulaic approach. Finishing the essay is the top priority.\nExam Time Strategy # Yes, you need to strategize the exam timing too. Trust me 100% — you will not finish the management comprehensive exam. It\u0026rsquo;s the most time-crunched exam I\u0026rsquo;ve ever taken. You know you can solve the problems, but you have no time to compute.\nOn exam day: morning — management comprehensive, 3 hours. Afternoon — English, 3 hours.\nEnglish: 3 hours, relatively little content, no need for repeated recalculation — time is completely sufficient. When I finished, I had 50 minutes left and left early.\nManagement comprehensive: 3 hours, absolutely not enough. In my pre-exam self-timed simulations, I consistently took 4 hours. During the real exam, for math — any question over 3 minutes, skip immediately. If it feels computationally heavy, skip immediately. For logic — absolutely cannot use your usual analytical approach. Speed-read the question (logic questions have colossal amounts of text), look at the options, pick whatever feels right. Logic questions requiring computation: temporarily abandon, come back later if time permits. For writing — read the prompt and start writing immediately. Write as fast as you possibly can (both essays combined no more than 1 hour). Every 2 minutes saved could rescue a multiple-choice question.\nYou don\u0026rsquo;t have to finish all the multiple-choice (do fill in the answer sheet completely though), but you MUST finish the writing. So the question order is important. Many people do the essays first, then multiple-choice. I did math first, then essays, then logic. Either way, don\u0026rsquo;t leave writing for the end. In the real exam, both essays must be finished within 55 minutes — 1,500 words total, plus reading the prompt and brainstorming. Try it once and you\u0026rsquo;ll know how impossibly short the time is.\nThe last 20 minutes: fill in the answer sheet. After filling it in, continue solving problems.\nThe Re-examination # Basic Information About the Re-examination # If you\u0026rsquo;ve made it past the national line or the school\u0026rsquo;s own cutoff — congratulations, you\u0026rsquo;ve completed 90% of the journey. The remaining 10% is the re-examination. The re-examination has a mandatory elimination rate (required by national policy), typically around 70–80% passing rate. Since elimination must exist, some people get cut every year. If you don\u0026rsquo;t prepare, you\u0026rsquo;re very likely to be among them. Here\u0026rsquo;s a joke: I got cut from Sichuan University\u0026rsquo;s re-examination~\nRe-examination timeframe: mid-to-late March each year.\nScore release: mid-March.\nContent: spoken English, specialized knowledge, comprehensive interview, politics (Sichuan University: open-book politics, no need to prepare. Wuhan University: closed-book written politics\u0026hellip;)\nSince the pandemic, re-examinations have been online interviews — no written test environment. Experts ask questions; you answer.\nSo you have three months to prepare for the re-examination. Conveniently, the preliminary exam ends late December, scores aren\u0026rsquo;t out yet, and February is Chinese New Year — realistically, most people start preparing only when scores are released. Take me as a cautionary example: I received the re-examination notice on March 21, the re-examination was on March 27 — I had six days to prepare, including spoken English and engineering management, which I\u0026rsquo;d never touched before\u0026hellip; So it was embarrassing: I couldn\u0026rsquo;t answer a single one of the examiner\u0026rsquo;s English questions, couldn\u0026rsquo;t answer a single specialized question. Cut from the re-examination.\nPost-Adjustment (Tiaoji) # When I found out I\u0026rsquo;d failed Sichuan University\u0026rsquo;s re-examination, my mood plummeted. But — every cloud has a silver lining. The adjustment (tiaoji) process was my lifesaver. While searching for adjustment schools, I found Wuhan University.\nThe China Graduate Admission Website has a dedicated adjustment window, giving students who failed their initial re-examination three more interview opportunities. You can fill in three preferences — three schools to apply to. Since each school has different re-examination dates and requirements, preparing for all of them is very hard. I focused mainly on preparing for Wuhan University\u0026rsquo;s adjustment. The adjustment, of course, also involves a re-examination — essentially, schools that haven\u0026rsquo;t filled their enrollment quotas run the process again, giving students who weren\u0026rsquo;t admitted in the first round another chance.\nAdjustment window: late March to early April.\nHow to Prepare for the Re-examination? # The re-examination is also highly competitive. Lazy people like me are not uncommon\u0026hellip; But no matter what, you\u0026rsquo;ve already invested over half a year — you can\u0026rsquo;t let it go down the drain. (I almost did\u0026hellip;) For non-specialist students like me, the hardest parts of the re-examination are spoken English and specialized knowledge. From score release to the re-examination, you have about one week (while still working!), so learning from scratch is impossible. Based on my experience, the following approaches, in descending order of importance:\nFind seniors who\u0026rsquo;ve been through it and get past re-examination materials (discreetly — sharing re-examination materials externally is prohibited) and course materials. See if anyone you know is at that school, or find groups on forums or Tieba. Search Bilibili for common graduate re-examination questions. Summarize them and memorize. Buy the school\u0026rsquo;s recommended reference books (usually course materials). They\u0026rsquo;re thick; you won\u0026rsquo;t finish them. Finally, the most important thing: mock re-examination. Summarize potential English questions, specialized questions, and comprehensive interview questions, then find a partner to act as the examiner for a mock interview.\nThere are other re-examination requirements — keep an eye on department updates and your email: score weightings, interview process, dual-camera setup, interview schedule, document preparation, etc.\nThe End # The 2022 national preliminary exam line was 185. My preliminary score was 210 (English 80, Management Comprehensive 130). Here\u0026rsquo;s my re-examination acceptance notice ^_^ Good luck to all working-student-warriors battered by society but still holding onto your dreams — may your graduate exam go smoothly. You\u0026rsquo;ve got this!!!\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/how-i-got-into-wuhan-universitys-part-time-masters-program/","section":"Posts","summary":"Why Did I Want to Pursue a Part-Time Master’s? # To improve my academic credentials. My undergraduate degree is from an ordinary university. A higher degree can add a bit of competitiveness in my career. I once submitted my resume to a state-owned enterprise and was completely ghosted. But a colleague with better academic credentials in the same office got through. So for state-owned enterprises, higher education is the knock on the door. To make up for failing the graduate entrance exam as a senior and revive the dream of graduate studies. Learning is never wrong — this is my creed. Differences Between Full-Time and Part-Time Graduate Programs # Study Mode # Full-time means you quit your job; part-time allows you to keep working. This basically locks in part-time as the only option for most working people.\n","title":"How I Got Into Wuhan University's Part-Time Master's Program","type":"posts"},{"content":"Source DB: Oracle (11.2.0.4) 192.168.10.141 Target DB: PGSQL (10.12) 192.168.10.128 OGG software version: (19.1.0.0.4) OGG download: Oracle GoldenGate Downloads glibc issue handling: https://www.cnblogs.com/hxlasky/p/16779047.html\n1. Install OGG Software on Source and Target # Source:\nA. Configure response file: oggcore.rsp\noracle.install.responseFileVersion=/home/oracle/oggcore.rsp INSTALL_OPTION=ORA11g SOFTWARE_LOCATION=/oracle/ogg START_MANAGER=false MANAGER_PORT=7809 DATABASE_LOCATION=/oracle/db/11.2.0.4 INVENTORY_LOCATION=/oracle/oraInventory UNIX_GROUP_NAME=oinstall B. Silent install OGG\n./runInstaller -silent -nowait -responseFile /home/oracle/oggcore.rsp oracle@szgtsp431-or@ecsdb\u0026gt;./runInstaller -silent -nowait -responseFile /home/oracle/oggcore.rsp Starting Oracle Universal Installer... Checking Temp space: must be greater than 120 MB. Actual 32405 MB Passed Checking swap space: must be greater than 150 MB. Actual 2048 MB Passed Preparing to launch Oracle Universal Installer from /tmp/OraInstall2020-08-14_08-57-27AM. Please wait ... You can find the log of this install session at: /oracle/oraInventory/logs/installActions2020-08-14_08-57-27AM.log Successfully Setup Software. The installation of Oracle GoldenGate Core was successful. Please check \u0026#39;/oracle/oraInventory/logs/silentInstall2020-08-14_08-57-27AM.log\u0026#39; for more details. 2. Set Database to Archive Mode # oracle@szgtsp431-or@ecsdb\u0026gt;sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Fri Aug 14 09:06:34 2020 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL\u0026gt; archive log list; Database log mode Archive Mode Automatic archival Enabled Archive destination /oracle/oradata/archivelog Oldest online log sequence 19 Next log sequence to archive 21 Current log sequence 21 3. Enable Force Logging and Minimum Supplemental Logging # alter database force logging; alter database add supplemental log data; alter system switch logfile; Verify force logging and minimum supplemental logging enabled:\nselect force_logging,supplemental_log_data_min from v$database; 4. Set enable_goldengate_replication Parameter # alter system set enable_goldengate_replication=true scope=both; If RAC, all nodes must be modified:\nalter system set enable_goldengate_replication=true scope=both sid=\u0026#39;*\u0026#39;; 5. Create OGG User, Tablespace, and Grant Privileges # create tablespace tbs_ogg datafile \u0026#39;/oracle/oradata/datafile/tbs_ogg01.dbf\u0026#39; size 100M; create user goldengate identified by 123456 default tablespace tbs_ogg temporary tablespace temp; grant create session,alter session to goldengate; grant alter system to goldengate; grant resource to goldengate; grant connect to goldengate; grant select any dictionary to goldengate; grant flashback any table to goldengate; grant select any table to goldengate; grant select any table to goldengate; grant insert any table to goldengate; grant update any table to goldengate; grant delete any table to goldengate; grant select on dba_clusters to goldengate; grant execute on dbms_flashback to goldengate; grant create table to goldengate; grant create sequence to goldengate; grant alter any table to goldengate; grant dba to goldengate; grant lock any table to goldengate; 6. Enable Table-Level Supplemental Logging # To sync table data from specific schemas, enable supplemental logging on those tables.\nCheck supplemental logging:\nSELECT owner, table_name, log_group_name, log_group_type, decode(always, \u0026#39;ALWAYS\u0026#39;, \u0026#39;Unconditional\u0026#39;, NULL, \u0026#39;Conditional\u0026#39;) always FROM dba_log_groups ORDER BY owner, table_name, log_group_name; Enable supplemental logging during low-activity window:\noracle@szgtsp431-or@ecsdb\u0026gt;ggsci Oracle GoldenGate Command Interpreter for Oracle Version 19.1.0.0.4 OGGCORE_19.1.0.0.0_PLATFORMS_191017.1054_FBO Linux, x64, 64bit (optimized), Oracle 11g on Oct 17 2019 23:13:12 Operating system character set identified as US-ASCII. Copyright (C) 1995, 2019, Oracle and/or its affiliates. All rights reserved. GGSCI (szgtsp431-or) 1\u0026gt; dblogin userid goldengate,password 123456 Successfully logged into database. GGSCI (szgtsp431-or as goldengate@ecsdb) 2\u0026gt; add trandata ecs.* 2020-08-14 09:13:54 INFO OGG-15132 Logging of supplemental redo data enabled for table ECS.DEPT. 2020-08-14 09:13:54 INFO OGG-15133 TRANDATA for scheduling columns has been added on table ECS.DEPT. 2020-08-14 09:13:54 INFO OGG-15135 TRANDATA for instantiation CSN has been added on table ECS.DEPT. 2020-08-14 09:13:54 INFO OGG-15132 Logging of supplemental redo data enabled for table ECS.INFO. ... Verify all supplemental logging added:\nselect * from ( select owner,table_name from dba_tables where owner in (\u0026#39;BGLWT\u0026#39;) minus select owner,table_name from dba_log_groups) order by owner,table_name; -- no rows selected = all table-level supplemental logging added successfully 7. Configure Manager Process # oracle@szgtsp431-or@ecsdb\u0026gt;ggsci ... GGSCI (szgtsp431-or) 1\u0026gt; dblogin userid goldengate,password 123456 Successfully logged into database. GGSCI (szgtsp431-or as goldengate@ecsdb) 2\u0026gt; create subdirs Creating subdirectories under current directory /home/oracle ... GGSCI (szgtsp431-or as goldengate@ecsdb) 3\u0026gt; edit param mgr PORT 7809 DYNAMICPORTLIST 7810-7980 PURGEOLDEXTRACTS ./dirdat/*, USECHECKPOINTS, MINKEEPDAYS 3 PURGEDDLHISTORY MINKEEPDAYS 7, MAXKEEPDAYS 10 LAGREPORTHOURS 1 LAGINFOMINUTES 30 LAGCRITICALMINUTES 45 8. Configure Extract Process # GGSCI (szgtsp431-or as goldengate@ecsdb) 7\u0026gt; add extract extecs, tranlog, threads 1,begin now EXTRACT added. GGSCI (szgtsp431-or as goldengate@ecsdb) 8\u0026gt; add exttrail ./dirdat/lt, extract extecs EXTTRAIL added. GGSCI (szgtsp431-or as goldengate@ecsdb) 9\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT STOPPED EXTECS 00:00:00 00:00:38 GGSCI (szgtsp431-or as goldengate@ecsdb) 10\u0026gt; edit param extecs EXTRACT extecs SETENV (ORACLE_HOME = \u0026#34;/oracle/db/11.2.0.4\u0026#34;) SETENV (ORACLE_SID = \u0026#34;ecsdb\u0026#34;) USERID goldengate, PASSWORD 123456 EXTTRAIL ./dirdat/lt TRANLOGOPTIONS EXCLUDEUSER goldengate TRANLOGOPTIONS DBLOGREADER DBOPTIONS ALLOWUNUSEDCOLUMN FETCHOPTIONS USESNAPSHOT, USELATESTVERSION, MISSINGROW REPORT STATOPTIONS REPORTFETCH WARNLONGTRANS 1h, CHECKINTERVAL 10m DYNAMICRESOLUTION DISCARDFILE ./dirrpt/extecs.dsc, APPEND, MEGABYTES 1024 DISCARDROLLOVER AT 6:00 REPORTROLLOVER AT 6:00 REPORTCOUNT EVERY 1 MINUTES, RATE DDL INCLUDE MAPPED DDLOPTIONS ADDTRANDATA, REPORT DDLOPTIONS NOCROSSRENAME, REPORT TABLE ECS.*; 9. Configure Pump Process # GGSCI (szgtsp431-or as goldengate@ecsdb) 11\u0026gt; add extract deliecs, exttrailsource ./dirdat/lt EXTRACT added. GGSCI (szgtsp431-or as goldengate@ecsdb) 12\u0026gt; add rmttrail ./dirdat/rt, extract deliecs, megabytes 500 RMTTRAIL added. GGSCI (szgtsp431-or as goldengate@ecsdb) 13\u0026gt; edit param deliecs EXTRACT deliecs PASSTHRU DYNAMICRESOLUTION RMTHOST 192.168.10.100, MGRPORT 7809 RMTTRAIL ./dirdat/rt DISCARDFILE ./dirrpt/deliecs.dsc, APPEND, MEGABYTES 1024 DISCARDROLLOVER AT 6:00 REPORTCOUNT EVERY 1 MINUTES, RATE REPORT AT 0:00 REPORT AT 1:00 ... REPORT AT 23:00 REPORTROLLOVER AT 00:00 STATOPTIONS RESETREPORTSTATS TABLE ECS.*; 10. Start Extract Process # GGSCI (szgtsp431-or as goldengate@ecsdb) 20\u0026gt; start extecs Sending START request to MANAGER ... EXTRACT EXTECS starting GGSCI (szgtsp431-or as goldengate@ecsdb) 21\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT STOPPED DELIECS 00:00:00 00:06:06 EXTRACT RUNNING EXTECS 00:00:00 00:00:01 11. Configure Target OGG Software # A. Upload OGG software and extract B. Configure OGG environment variables\n[pgsql@szgtsp428-or ~]$ vi .bash_profile ## .bash_profile ## Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi ## User specific environment and startup programs PATH=$PATH:$HOME/bin export PATH export PGHOME=/usr/local/pgsql export PGDATA=/data/pgsql export OGG_HOME=/data/ogg export PATH=$PATH:$PGHOME/bin:$OGG_HOME LD_LIBRARY_PATH=$PGHOME/lib LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/lib:/usr/lib:/usr/local/lib:$OGG_HOME/lib export LD_LIBRARY_PATH export ODBCINI=/home/pgsql/odbc.ini export DD_ODBC_HOME=/data/ogg [pgsql@szgtsp428-or ~]$ ggsci Oracle GoldenGate Command Interpreter for PostgreSQL Version 19.1.0.0.200714 OGGCORE_19.1.0.0.0OGGBP_PLATFORMS_200628.2141 Linux, x64, 64bit (optimized), PostgreSQL on Jun 29 2020 03:59:15 Operating system character set identified as UTF-8. Copyright (C) 1995, 2019, Oracle and/or its affiliates. All rights reserved. GGSCI (szgtsp428-or) 1\u0026gt; 12. Create Database and Table on Target # ecsdb=# \\l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+-------------+-------------+------------------- ecsdb | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pgsql + | | | | | pgsql=CTc/pgsql template1 | pgsql | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pgsql + | | | | | pgsql=CTc/pgsql (4 rows) ecsdb=# \\d List of relations Schema | Name | Type | Owner --------+--------------+-------+---------- public | student_info | table | postgres (1 row) ecsdb=# select * from student_info; id | name | address ----+------+--------- 1 | Zhang San | Guangzhou 2 | Li Si | Shenzhen 3 | Wang Wu | Shanghai 4 | Zhao Liu | Beijing 5 | Sun Qi | Wuhan 6 | A Da | Chengdu 7 | A Er | Nanjing (7 rows) 13. Configure Target Manager Process and Start # [pgsql@szgtsp428-or ogg]$ ggsci Oracle GoldenGate Command Interpreter for PostgreSQL ... GGSCI (szgtsp428-or) 1\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER STOPPED GGSCI (szgtsp428-or) 2\u0026gt; create subdirs Creating subdirectories under current directory /data/ogg ... GGSCI (szgtsp428-or) 3\u0026gt; edit param mgr port 7809 GGSCI (szgtsp428-or) 4\u0026gt; info all ... GGSCI (szgtsp428-or) 5\u0026gt; start mgr Manager started. GGSCI (szgtsp428-or) 7\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING Now start the pump process on source (deliecs):\noracle@szgtsp431-or@ecsdb\u0026gt;ggsci ... GGSCI (szgtsp431-or) 1\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT ABENDED DELIECS 00:00:00 01:06:41 EXTRACT RUNNING EXTECS 00:00:00 00:00:07 GGSCI (szgtsp431-or) 2\u0026gt; start deliecs Sending START request to MANAGER ... EXTRACT DELIECS starting GGSCI (szgtsp431-or) 3\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DELIECS 00:00:00 01:06:55 EXTRACT RUNNING EXTECS 00:00:00 00:00:01 14. Target PostgreSQL Parameter Adjustment # wal_level = logical #minimal, replica, or logical max_replication_slots = 10 #max number of replication slots max_wal_sender = 10 #maximum number of wal sender processes wal_receiver_status_interval=10s #optional, keep the system default wal_sender_timeout = 60s #optional, keep the system default track_commit_timestamp=off #optional, keep the system default Restart PostgreSQL after adjusting parameters:\n[pgsql@szgtsp428-or pgsql]$ pg_ctl stop -D /data/pgsql/ -l /data/pgsql/logfile waiting for server to shut down.... done server stopped [pgsql@szgtsp428-or pgsql]$ pg_ctl start -D /data/pgsql/ waiting for server to start.... done server started 15. Data Source Configuration (odbc.ini) # [ODBC Data Sources] PGDSN=DataDirect 10.12 PostgreSQL Wire Protocol postgres=DataDirect 10.12 PostgreSQL Wire Protocol scott=DataDirect 10.12 PostgreSQL Wire Protocol [ODBC] IANAAppCodePage=4 InstallDir=/data/ogg [PGDSN] Driver=/data/ogg/lib/GGpsql25.so Description=DataDirect 10.12 PostgreSQL Wire Protocol Database=ecsdb HostName=127.0.0.1 PortNumber=5432 LogonID=postgres Password=123456 16. Connection Test # [pgsql@szgtsp428-or ~]$ cd /data/ogg [pgsql@szgtsp428-or ogg]$ ggsci ... GGSCI (szgtsp428-or) 1\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING GGSCI (szgtsp428-or) 2\u0026gt; dblogin sourcedb pgdsn userid postgres, password postgres 2020-08-14 11:35:01 INFO OGG-03036 Database character set identified as UTF-8. Locale: en_US.UTF-8. 2020-08-14 11:35:01 INFO OGG-03037 Session character set identified as UTF-8. Successfully logged into database. 17. Configure and Start Replicat Process on Target # Add checkpoint table:\nGGSCI (szgtsp428-or) 1\u0026gt; dblogin sourcedb pgdsn userid postgres, password 123456 Successfully logged into database. GGSCI (szgtsp428-or as postgres@pgdsn) 2\u0026gt; add checkpointtable public.chkt Successfully created checkpoint table public.chkt. Configure replicat:\nGGSCI (szgtsp428-or as postgres@pgdsn) 34\u0026gt; edit param repl REPLICAT repl SOURCEDEFS ./dirdef/student_info.def SETENV (PGCLIENTENCODING = \u0026#34;UTF8\u0026#34;) SETENV (ODBCINI=\u0026#34;/home/pgsql/odbc.ini\u0026#34;) SETENV (NLS_LANG=\u0026#34;AMERICAN_AMERICA.AL32UTF8\u0026#34;) targetdb pgdsn userid postgres, password 123456 DISCARDFILE ./dirrpt/repl.dsc, purge MAP ecs.student_info, TARGET public.student_info; GGSCI (szgtsp428-or as postgres@pgdsn) 36\u0026gt; add replicat repl,exttrail ./dirdat/rt,checkpointtable public.chkt REPLICAT added. GGSCI (szgtsp428-or as postgres@pgdsn) 38\u0026gt; start repl Sending START request to MANAGER ... REPLICAT REPL starting GGSCI (szgtsp428-or as postgres@pgdsn) 55\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING REPL 00:00:00 00:00:08 18. Test Verification # First, create matching table structure on target:\ncreate table student_info (id int primary key, name varchar(100), address varchar(100)); Then initialize data:\nConfigure extinit process on source:\nGGSCI (szgtsp431-or as goldengate@ecsdb) 17\u0026gt; edit param extinit EXTRACT extinit userid goldengate, PASSWORD 123456 REPORTCOUNT EVERY 30 MINUTES, RATE DISCARDFILE ./dirrpt/extinit.dsc, APPEND, MEGABYTES 1024 RMTHOST 192.168.10.100,MGRPORT 7809, compress RMTTASK replicat,GROUP replinit TABLE ecs.student_info; GGSCI (szgtsp431-or as goldengate@ecsdb) 18\u0026gt; ADD EXTRACT extinit, SOURCEISTABLE EXTRACT added. Configure replinit process on target:\nGGSCI (szgtsp428-or as postgres@pgdsn) 28\u0026gt; edit param replinit REPLICAT replinit targetDB pgdsn, USERID postgres, PASSWORD 123456 discardfile ./dirrpt/replinit.dsc, PURGE SOURCEDEFS ./dirdef/student_info.def Map ecs.student_info,target public.student_info; GGSCI (szgtsp428-or as postgres@pgdsn) 29\u0026gt; add replicat repinit, SPECIALRUN REPLICAT added. Start Oracle-to-PG data initialization:\nGGSCI (szgtsp431-or as goldengate@ecsdb) 9\u0026gt; start extinit Sending START request to MANAGER ... EXTRACT EXTINIT starting Target: (view initialization row count via View report replicat)\nCheck both sides:\nSource (Oracle):\nSQL\u0026gt; select * from student_info; ID NAME ADDRESS ---------- ---------- ---------- 1 Zhang San Guangzhou 2 Li Si Shenzhen 3 Wang Wu Shanghai 4 Zhao Liu Beijing 5 Sun Qi Wuhan 6 A Da Chengdu 7 A Er Nanjing 8 A San Beijing 8 rows selected. Target (PostgreSQL):\necsdb=# select * from student_info; id | name | address ----+------+--------- 1 | Zhang San | Guangzhou 2 | Li Si | Shenzhen 3 | Wang Wu | Shanghai 4 | Zhao Liu | Beijing 5 | Sun Qi | Wuhan 6 | A Da | Chengdu 7 | A Er | Nanjing 8 | A San | Beijing (8 rows) Insert data on source:\nSQL\u0026gt; insert into ecs.student_info values (10,\u0026#39;aa\u0026#39;,\u0026#39;bb\u0026#39;); 1 row created. SQL\u0026gt; commit; Commit complete. Check target — data synchronized successfully.\nOriginal link: https://lastdba.com/2024/08/13/ogg搭建oracle-pg同步实操步骤/\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/ogg-oracle-to-postgresql-sync-hands-on-steps/","section":"Posts","summary":"Source DB: Oracle (11.2.0.4) 192.168.10.141 Target DB: PGSQL (10.12) 192.168.10.128 OGG software version: (19.1.0.0.4) OGG download: Oracle GoldenGate Downloads glibc issue handling: https://www.cnblogs.com/hxlasky/p/16779047.html\n1. Install OGG Software on Source and Target # Source:\nA. Configure response file: oggcore.rsp\noracle.install.responseFileVersion=/home/oracle/oggcore.rsp INSTALL_OPTION=ORA11g SOFTWARE_LOCATION=/oracle/ogg START_MANAGER=false MANAGER_PORT=7809 DATABASE_LOCATION=/oracle/db/11.2.0.4 INVENTORY_LOCATION=/oracle/oraInventory UNIX_GROUP_NAME=oinstall B. Silent install OGG\n./runInstaller -silent -nowait -responseFile /home/oracle/oggcore.rsp oracle@szgtsp431-or@ecsdb\u003e./runInstaller -silent -nowait -responseFile /home/oracle/oggcore.rsp Starting Oracle Universal Installer... Checking Temp space: must be greater than 120 MB. Actual 32405 MB Passed Checking swap space: must be greater than 150 MB. Actual 2048 MB Passed Preparing to launch Oracle Universal Installer from /tmp/OraInstall2020-08-14_08-57-27AM. Please wait ... You can find the log of this install session at: /oracle/oraInventory/logs/installActions2020-08-14_08-57-27AM.log Successfully Setup Software. The installation of Oracle GoldenGate Core was successful. Please check '/oracle/oraInventory/logs/silentInstall2020-08-14_08-57-27AM.log' for more details. 2. Set Database to Archive Mode # oracle@szgtsp431-or@ecsdb\u003esqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Fri Aug 14 09:06:34 2020 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL\u003e archive log list; Database log mode Archive Mode Automatic archival Enabled Archive destination /oracle/oradata/archivelog Oldest online log sequence 19 Next log sequence to archive 21 Current log sequence 21 3. Enable Force Logging and Minimum Supplemental Logging # alter database force logging; alter database add supplemental log data; alter system switch logfile; Verify force logging and minimum supplemental logging enabled:\n","title":"OGG Oracle-to-PostgreSQL Sync — Hands-On Steps","type":"posts"},{"content":"OGG software version: (19.1.0.0.4) Oracle version: 11.2.0.4 PG version: pg10 OGG download: https://www.oracle.com/technetwork/middleware/goldengate/downloads/index.html\nglibc issue handling: https://www.cnblogs.com/hxlasky/p/16779047.html\n1. Create Database and Table on Source # [root@node2 ~]# su - postgres Last login: Tue Jul 21 21:08:52 CST 2020 on pts/0 [postgres@node2 ~]$ pg_ctl -D /opt/pgsql_data -l logfile start waiting for server to start.... done server started postgres=# create database test postgres=# \\c lzldb postgres=# create table tab1(id int primary key,name varchar(20)) 2. Create Database and Table on Target # sqlplus / as sysdba SQL\u0026gt; create table ORALZL.tab1(id number primary key,name varchar2(20)); 3. Extract and Install OGG for PostgreSQL # -- Unlike OGG for Oracle, OGG for PG only needs extraction. Oracle version requires running runInstaller. [postgres@node1 ~]$ id postgres uid=54323(postgres) gid=54330(postgres) groups=54330(postgres) [postgres@node1 ~]$ exit logout [root@node1 ~]# mkdir /ogg [root@node1 ~]# chown -R postgres /ogg [root@node1 ~]# chmod -R 755 /ogg [root@node1 ~]# [root@node1 soft]# ls -l total 240744 -rw-r--r--. 1 root root 87028695 Jul 22 02:51 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip [root@node1 soft]# chmod 777 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip [root@node1 soft]# unzip 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip Archive: 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip inflating: ggs_Linux_x64_PostgreSQL_64bit.tar inflating: OGG-19.1.0.0-README.txt inflating: release-notes-oracle-goldengate_19.1.0.200714.pdf [root@node1 soft]# chmod 777 ggs_Linux_x64_PostgreSQL_64bit.tar [root@node1 soft]# su - postgres [postgres@node1 ~]$ cd /soft [postgres@node1 soft]$ tar -xf ggs_Linux_x64_PostgreSQL_64bit.tar -C /ogg 4. Configure PG User Environment Variables # Source PG:\n[postgres@node1 ~]$ cat .bash_profile ## .bash_profile ## Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi ## User specific environment and startup programs PATH=$PATH:$HOME/.local/bin:$HOME/bin export GGHOME=/ogg export PG_DATA=/opt/pgsql/pgsql/bin export PATH=$PG_DATA:$PATH export PG_HOME=/opt/pgsql/pgsql export LD_LIBRARY_PATH=$PG_HOME/lib:$LD_LIBRARY_PATH:$GGHOME/lib export ODBCINI=/home/postgres/odbc.ini export DD_ODBC_HOME=/ogg export PATH [postgres@node1 ~]$ source .bash_profile 5. Configure Manager Process # [postgres@node1 ~]$ cd /ogg [postgres@node1 ogg]$ ./ggsci Oracle GoldenGate Command Interpreter for PostgreSQL Version 19.1.0.0.200714 OGGCORE_19.1.0.0.0OGGBP_PLATFORMS_200628.2141 Linux, x64, 64bit (optimized), PostgreSQL on Jun 29 2020 03:59:15 Operating system character set identified as UTF-8. Copyright (C) 1995, 2019, Oracle and/or its affiliates. All rights reserved. GGSCI (node1) 2\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER STOPPED GGSCI (node1) 3\u0026gt; create subdirs Creating subdirectories under current directory /ogg Parameter file /ogg/dirprm: created. Report file /ogg/dirrpt: created. Checkpoint file /ogg/dirchk: created. Process status files /ogg/dirpcs: created. SQL script files /ogg/dirsql: created. Database definitions files /ogg/dirdef: created. Extract data files /ogg/dirdat: created. Temporary files /ogg/dirtmp: created. Credential store files /ogg/dircrd: created. Masterkey wallet files /ogg/dirwlt: created. Dump files /ogg/dirdmp: created. GGSCI (node1) 4\u0026gt; edit params mgr GGSCI (node1) 5\u0026gt; view params mgr port 7809 GGSCI (node1) 6\u0026gt; start mgr Manager started. GGSCI (node1) 7\u0026gt; info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING 6. Adjust Source PostgreSQL Parameters # [postgres@node1 ogg]$ vi /opt/pgsql_data/postgresql.conf wal_level = logical #minimal, replica, or logical max_replication_slots = 10 #max number of replication slots max_wal_sender = 10 #maximum number of wal sender processes wal_receiver_status_interval=10s #optional, keep the system default wal_sender_timeout #optional, keep the system default track_commit_timestamp #optional, keep the system default wal_receiver_status_interval=10s wal_sender_timeout = 60s track_commit_timestamp=off Restart source PostgreSQL after adjustment:\n[postgres@node1 ogg]$ pg_ctl -D /opt/pgsql_data -l logfile stop [postgres@node1 ogg]$ pg_ctl -D /opt/pgsql_data -l logfile start 7. Configure OGG for PG Data Source # cd /home/postgres/ vi odbc.ini [ODBC Data Sources] PGDSN=DataDirect 7.1 PostgreSQL Wire Protocol postgres=DataDirect 7.1 PostgreSQL Wire Protocol scott=DataDirect 7.1 PostgreSQL Wire Protocol [ODBC] IANAAppCodePage=4 InstallDir=/ogg [PGDSN] Driver=/ogg/lib/GGpsql25.so Description=DataDirect 7.1 PostgreSQL Wire Protocol Database=test HostName=192.168.1.112 PortNumber=5432 LogonID=postgres Password=postgres 8. Connection Test # [postgres@node1 ~]$ cd /ogg [postgres@node1 ogg]$ ./ggsci --dblogin sourcedb pgdsn userid pg, password 123456 GGSCI (node1) 1\u0026gt; dblogin sourcedb pgdsn userid postgres, password postgres 2020-07-22 03:10:44 INFO OGG-03036 Database character set identified as UTF-8. Locale: en_US.UTF-8. 2020-07-22 03:10:44 INFO OGG-03037 Session character set identified as UTF-8. Successfully logged into database. GGSCI (node1 as postgres@pgdsn) 2\u0026gt; 9. Enable Table-Level Supplemental Logging # Source:\nGGSCI (node1) 3\u0026gt; dblogin sourcedb pgdsn userid postgres, password postgres 2020-07-22 03:21:01 INFO OGG-03036 Database character set identified as UTF-8. Locale: en_US.UTF-8. 2020-07-22 03:21:01 INFO OGG-03037 Session character set identified as UTF-8. Successfully logged into database. GGSCI (node1 as postgres@pgdsn) 4\u0026gt; add trandata public.tab1 --If table has primary key, this step can be skipped Logging of supplemental log data is enabled for table public.tab1. REPLICA IDENTITY was DEFAULT and is changed to FULL GGSCI (node1 as postgres@pgdsn) 5\u0026gt; GGSCI (node1 as postgres@pgdsn) 5\u0026gt; info trandata public.tab1 Logging of supplemental log data is enabled for table public.t1 with REPLICA IDENTITY set to FULL 10. Register Extract Process on PG # Registering an extract process on PG essentially creates a replication slot. The output plugin defaults to test_decoding.\nGGSCI (node1 as postgres@pgdsn) 6\u0026gt; Register Extract ext_pg 2020-07-22 03:25:27 INFO OGG-25355 Successfully created replication slot \u0026#39;ext_pg_2947c06e0ea2ec74\u0026#39; for EXTRACT group \u0026#39;EXT_PG\u0026#39; in database \u0026#39;test\u0026#39;. 11. Configure Extract and Pump Processes # Configure extract process:\nedit param ext_pg SETENV ( PGCLIENTENCODING = \u0026#34;UTF8\u0026#34; ) SETENV (NLS_LANG=\u0026#34;AMERICAN_AMERICA.AL32UTF8\u0026#34;) extract ext_pg SETENV (ODBCINI=\u0026#34;/home/pg/odbc.ini\u0026#34; ) SOURCEDB pgdsn, USERID pg, PASSWORD 123456 exttrail ./dirdat/st TABLE PUBLIC.TAB1; ----GETTRUNCATES ### This feature on PostgreSQL 10.12: ERROR OGG-25541 GETTRUNCATES is not valid. PostgreSQL supports TRUNCATE capture from version 11. Note: PG to Oracle cannot sync TRUNCATE commands.\nConfigure pump process:\nextract pump_pg SETENV (ODBCINI=\u0026#34;/home/pg/odbc.ini\u0026#34; ) RMTHOST 172.17.100.150, MGRPORT 7809, compress numfiles 10000 RMTTRAIL ./dirdat/rt TABLE PUBLIC.TAB1; 12. Add Trail and Start Extract/Pump # ADD extract ext_pg, TRANLOG,BEGIN now add exttrail ./dirdat/st,extract ext_pg,megabytes 500 add extract pump_pg,exttrailsource ./dirdat/st add rmttrail ./dirdat/rt,extract pump_pg,megabytes 500 start ext_pg start pump_pg 13. Configure defgen # If table structures are consistent, you can configure ASSUMETARGETDEFS.\nedit param defgen DEFSFILE ./dirdef/tab1.def, PURGE SOURCEDB pgdsn, USERID pg, PASSWORD 123456 TABLE PUBLIC.tab1; Generate table definition file:\ndefgen paramfile /oggpg/dirdef/tab1.prm Copy the defgen file to the target\u0026rsquo;s dirdef directory.\n14. Verify Trail Delivery on Target # [oracle@lzl dirdat]$ cd dirdat [oracle@lzl dirdat]$ ll -rw-r----- 1 pg pg 1439 Feb 28 11:02 rt000000000 15. Register Extract Process on PG # Registering an extract process on PG creates a replication slot:\nGGSCI (node1 as postgres@pgdsn) 6\u0026gt; Register Extract ext_pg 2020-07-22 03:25:27 INFO OGG-25355 Successfully created replication slot \u0026#39;ext_pg_2947c06e0ea2ec74\u0026#39; for EXTRACT group \u0026#39;EXT_PG\u0026#39; in database \u0026#39;test\u0026#39;. 16. Configure Oracle User Environment Variables # export ORACLE_BASE=/oracle/app/oracle export ORACLE_HOME=$ORACLE_BASE/product/11.2.0/dbhome_1 export ORACLE_SID=oralzl export OGG_HOME=/oggfororacle export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$PATH export TNS_ADMIN=$ORACLE_HOME/network/admin export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$OGG_HOME:$ORACLE_HOME/lib32:/lib/usr/lib:/usr/local/lib 17. Configure Oracle Listener and TNS # OGG for Oracle defaults to using TNS_ADMIN\u0026rsquo;s tns. You can also manually configure during extract configuration, e.g.: USERID goldengate@127.0.0.1:1521/oralzl, PASSWORD 123456\n18. Install OGG for Oracle on Target # Download OGG software. Configure oggcore.rsp file:\noracle.install.responseFileVersion=/home/oracle/oggcore.rsp INSTALL_OPTION=ORA11g SOFTWARE_LOCATION=/ogg START_MANAGER=false MANAGER_PORT=7809 DATABASE_LOCATION=/oracle/db/11.2.0.4 INVENTORY_LOCATION=/oracle/oraInventory UNIX_GROUP_NAME=oinstall Silent install OGG:\n./runInstaller -silent -nowait -responseFile /home/oracle/oggcore.rsp 19. Oracle Database User and Privileges # create user goldengate identified by \u0026#34;123456\u0026#34;; grant create session,alter session to goldengate; grant alter system to goldengate; grant resource to goldengate; grant connect to goldengate; grant select any dictionary to goldengate; grant flashback any table to goldengate; grant select any table to goldengate; grant select any table to goldengate; grant insert any table to goldengate; grant update any table to goldengate; grant delete any table to goldengate; grant select on dba_clusters to goldengate; grant execute on dbms_flashback to goldengate; grant create table to goldengate; grant create sequence to goldengate; grant alter any table to goldengate; grant dba to goldengate; grant lock any table to goldengate; 20. Target Manager Process # edit param mgr PORT 7809 DYNAMICPORTLIST 7810-7980 PURGEOLDEXTRACTS ./dirdat/*, USECHECKPOINTS, MINKEEPDAYS 3 PURGEDDLHISTORY MINKEEPDAYS 7, MAXKEEPDAYS 10 LAGREPORTHOURS 1 LAGINFOMINUTES 30 LAGCRITICALMINUTES 45 start mgr 21. Configure Replicat Process on Target # GGSCI (node2) 8\u0026gt; dblogin userid goldengate@127.0.0.1:1521/oralzl,password 123456 GGSCI (node2 as postgres@pgdsn) 9\u0026gt; add checkpointtable goldengate.chkt Successfully created checkpoint table public.chkt. Replicat process:\nedit param rep_pg REPLICAT rep_pg USERID goldengate@127.0.0.1:1521/oralzl, PASSWORD 123456 SOURCEDEFS ./dirdef/tab1.def MAP public.tab1, TARGET oralzl.tab1; add replicat rep_pg,exttrail ./dirdat/rt,checkpointtable goldengate.chkt start rep_pg 22. Test Sync # [postgres@node1 ~]$ psql postgres=# \\c lzldb test=# \\d tab1; ​ Table \u0026#34;public.tab1\u0026#34; Column | Type | Collation | Nullable | Default --------+-----------------------+-----------+----------+--------- id | integer | | not null | name | character varying(20) | | | Indexes: \u0026#34;t1_pkey\u0026#34; PRIMARY KEY, btree (id) lzldb=# insert into t2 values(1,\u0026#39;lzl1\u0026#39;) ; INSERT 0 1 lzldb=# select * from t2; id | name ----+------ 1 | lzl1 [postgres@node2 ~]$sqlplus / as sysdba SQL\u0026gt; select * from oralzl.tab1; ​ id name ---------- ---------- ​ 1 lzl1 Original link: https://lastdba.com/2024/08/13/ogg搭建pg-oracle同步实操步骤/\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/ogg-postgresql-to-oracle-sync-hands-on-steps/","section":"Posts","summary":"OGG software version: (19.1.0.0.4) Oracle version: 11.2.0.4 PG version: pg10 OGG download: https://www.oracle.com/technetwork/middleware/goldengate/downloads/index.html\nglibc issue handling: https://www.cnblogs.com/hxlasky/p/16779047.html\n1. Create Database and Table on Source # [root@node2 ~]# su - postgres Last login: Tue Jul 21 21:08:52 CST 2020 on pts/0 [postgres@node2 ~]$ pg_ctl -D /opt/pgsql_data -l logfile start waiting for server to start.... done server started postgres=# create database test postgres=# \\c lzldb postgres=# create table tab1(id int primary key,name varchar(20)) 2. Create Database and Table on Target # sqlplus / as sysdba SQL\u003e create table ORALZL.tab1(id number primary key,name varchar2(20)); 3. Extract and Install OGG for PostgreSQL # -- Unlike OGG for Oracle, OGG for PG only needs extraction. Oracle version requires running runInstaller. [postgres@node1 ~]$ id postgres uid=54323(postgres) gid=54330(postgres) groups=54330(postgres) [postgres@node1 ~]$ exit logout [root@node1 ~]# mkdir /ogg [root@node1 ~]# chown -R postgres /ogg [root@node1 ~]# chmod -R 755 /ogg [root@node1 ~]# [root@node1 soft]# ls -l total 240744 -rw-r--r--. 1 root root 87028695 Jul 22 02:51 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip [root@node1 soft]# chmod 777 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip [root@node1 soft]# unzip 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip Archive: 19100200714_ggs_Linux_x64_PostgreSQL_64bit.zip inflating: ggs_Linux_x64_PostgreSQL_64bit.tar inflating: OGG-19.1.0.0-README.txt inflating: release-notes-oracle-goldengate_19.1.0.200714.pdf [root@node1 soft]# chmod 777 ggs_Linux_x64_PostgreSQL_64bit.tar [root@node1 soft]# su - postgres [postgres@node1 ~]$ cd /soft [postgres@node1 soft]$ tar -xf ggs_Linux_x64_PostgreSQL_64bit.tar -C /ogg 4. Configure PG User Environment Variables # Source PG:\n","title":"OGG PostgreSQL-to-Oracle Sync — Hands-On Steps","type":"posts"},{"content":" What is Logical Replication # PostgreSQL logical replication is based on logical decoding, which parses WAL log streams into a specified format for output. The subscriber node receives the parsed data and applies it.\nLogical replication differs from streaming replication (physical replication) which is based on instance-level primary-standby where the physical structures are identical. Logical replication can selectively replicate at the table level. Logical Replication in official documentation specifically refers to the \u0026ldquo;publish-subscribe\u0026rdquo; model. In fact, many tools can use logical decoding for heterogeneous database data synchronization.\npg9.4\u0026rsquo;s pglogical plugin can support logical replication (https://github.com/2ndQuadrant/pglogical), and pg10 onwards natively supports logical replication.\nLogical replication can be used for database upgrades, heterogeneous data migration, table-level data synchronization links, subscribing to data streams, etc.\nLogical Decoding # Logical decoding can parse table data changes in WAL logs into row data streams or SQL text. These row data streams or SQL text can be consumed by other types of databases or software. The specific parsing format is determined by the output plugin.\nReplication Slots # In logical replication, a replication slot represents a data change stream. Like physical replication slots, logical replication slots also ensure that after an abnormal replication interruption, the related WAL logs are not deleted, so that WAL log parsing can continue after replication reconnects. A database can have multiple replication slots. Each replication slot has only one output plugin, and each replication slot represents one replication link. Replication slots are essentially used to manage replication links. Unlike streaming replication which can function without replication slots, logical replication must have replication slots.\nOutput Plugin # The output plugin converts WAL log information into the format required by the replication slot. PostgreSQL has some built-in output plugins and additional ones can be added through plugins. Each logical replication slot has an output plugin for WAL-related parsing work.\nOutput plugins use callback functions to manage parsing. For example, OUTPUT_PLUGIN_BINARY_OUTPUT and OUTPUT_PLUGIN_TEXTUAL_OUTPUT are used to set whether the out_type is binary or text. There are also callback functions to notify the plugin of transaction data changes and sort transactions. Callback functions of course don\u0026rsquo;t need to be used manually; some built-in output plugins are already packaged.\nEach output plugin has some different parsing behaviors and output formats.\nSeveral Common Output Plugins # test_decoding: This is a sample output plugin, essentially the raw form of an output plugin. Official documentation says it\u0026rsquo;s a template, but it can still parse. This output plugin comes with PostgreSQL but needs to be compiled in contrib.\npgoutput: The default output plugin for the publish-subscribe model. In publish-subscribe, the walsender process uses this output plugin to logically decode WAL logs.\ndecoder_raw: Parses into SQL text format. This is not included with PostgreSQL; compile it yourself: https://github.com/michaelpq/pg_plugins/tree/main/decoder_raw\nwal2json: This output plugin converts WAL log information into JSON format.\nOther output plugins can be referenced at: https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins\nSome domestic vendors have also made their own output plugins.\nRelationship between several output plugins and logical replication plugins:\npgoutput, test_decoding, and wal2json have been introduced above.\npglogical was the predecessor of pglogical replication in pg9.4.\nBDR was developed by 2ndQuadrant, supporting bidirectional replication and DDL synchronization with more powerful features. BDR 3.0 onwards became closed-source.\nFunctions and Tools for Manually Receiving Parsed Data # pg_logical_slot_get_changes(): Displays parsed data and consumes it.\npg_logical_slot_peek_changes(): Displays parsed data without consuming it.\npg_recvlogical: A tool included with PostgreSQL that can consume data within a replication slot, equivalent to the downstream of logical replication. The corresponding physical WAL receiving tool is pg_receivewal.\nLogical Decoding Test 1: Observing data parsing with 2 different output plugins # -- Create two logical replication slots using logical_test and logical_raw respectively lzldb=# select pg_create_logical_replication_slot(\u0026#39;logical_test\u0026#39;,\u0026#39;test_decoding\u0026#39;); pg_create_logical_replication_slot ------------------------------------ (logical_test,0/1756F50) lzldb=# select pg_create_logical_replication_slot(\u0026#39;logical_raw\u0026#39;,\u0026#39;decoder_raw\u0026#39;); pg_create_logical_replication_slot ------------------------------------ (logical_raw,0/1756F88) -- Only the upstream is created, slot is in f state lzldb=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size --------------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------- logical_test | test_decoding | logical | 16385 | lzldb | f | f | | | 558 | 0/1766878 | 0/17668B0 | reserved | logical_raw | decoder_raw | logical | 16385 | lzldb | f | f | | | 557 | 0/1756F50 | 0/1756F88 | reserved | -- Create a table lzldb=# create table tdecoder222(a int,b varchar(10)); CREATE TABLE -- Attempt to get this DDL lzldb=# SELECT * FROM pg_logical_slot_get_changes(\u0026#39;logical_raw\u0026#39;, NULL, NULL, \u0026#39;include-xids\u0026#39;, \u0026#39;0\u0026#39;); ERROR: option \u0026#34;include-xids\u0026#34; = \u0026#34;0\u0026#34; is unknown CONTEXT: slot \u0026#34;logical_raw\u0026#34;, output plugin \u0026#34;decoder_raw\u0026#34;, in the startup callback lzldb=# SELECT * FROM pg_logical_slot_get_changes(\u0026#39;logical_test\u0026#39;, NULL, NULL, \u0026#39;include-xids\u0026#39;, \u0026#39;0\u0026#39;); lsn | xid | data -----------+-----+-------- 0/17669C8 | 558 | BEGIN 0/1776778 | 558 | COMMIT -- We can see that decoder_raw didn\u0026#39;t parse the DDL at all, and logical_test only got the DDL transaction without the DDL statement itself, essentially not parsing the DDL -- Insert a row lzldb=# insert into tdecoder222 values(1,\u0026#39;lzl\u0026#39;); INSERT 0 1 lzldb=# select * from pg_logical_slot_peek_changes(\u0026#39;logical_test\u0026#39;,null,null); lsn | xid | data -----------+-----+--------------------------------------------------------------------------- 0/1776890 | 560 | BEGIN 560 0/1776890 | 560 | table public.tdecoder222: INSERT: a[integer]:1 b[character varying]:\u0026#39;lzl\u0026#39; 0/1776900 | 560 | COMMIT 560 lzldb=# select * from pg_logical_slot_peek_changes(\u0026#39;logical_raw\u0026#39;,null,null); lsn | xid | data -----------+-----+---------------------------------------------------------- 0/1776890 | 560 | INSERT INTO public.tdecoder222 (a, b) VALUES (1, \u0026#39;lzl\u0026#39;); -- test_decoding parsed the transaction -- decoder_raw parsed the transaction into SQL statements This test allows us to conclude:\nReplication slots in f state still parse, waiting for downstream consumption Each output plugin has some different parsing behaviors and output formats Logical Decoding Test 2: Using pg_recvlogical to receive logically decoded data, simulating a logical replication link # -- Configure passwordless login [pg@lzl ~]$ vi .pgpass [pg@lzl ~]$ cat .pgpass lzl:5410:lzldb:pg:pg [pg@lzl ~]$ chmod 0600 .pgpass -- Start pg_recvlogical [pg@lzl ~]$ pg_recvlogical -h lzl -p 5410 -d lzldb -U pg --slot=logical_raw --start -f recv.sql \u0026amp; [pg@lzl ~]$ ps -ef|grep recv|grep -v grep pg 7747 7355 0 21:40 pts/3 00:00:00 pg_recvlogical -h lzl -p 5410 -d lzldb -U pg --slot=logical_raw --start -f recv.sql lzldb=# insert into tdecoder222 values(2,\u0026#39;qwe\u0026#39;); INSERT 0 1 lzldb=# update tdecoder222 set b=\u0026#39;asd\u0026#39; where a=2; UPDATE 1 [pg@lzl ~]$ tail -2f recv.sql INSERT INTO public.tdecoder222 (a, b) VALUES (2, \u0026#39;qwe\u0026#39;); -- update was not correctly parsed -- Add a primary key to the table lzldb=# alter table tdecoder222 add primary key(a); ALTER TABLE lzldb=# insert into tdecoder222 values(100,\u0026#39;lzl1\u0026#39;); INSERT 0 1 lzldb=# insert into tdecoder222 values(200,\u0026#39;lzl2\u0026#39;); INSERT 0 1 lzldb=# update tdecoder222 set b=\u0026#39;lzlupdate\u0026#39; where a=200; UPDATE 1 [pg@lzl ~]$ tail -3f recv.sql INSERT INTO public.tdecoder222 (a, b) VALUES (100, \u0026#39;lzl1\u0026#39;); INSERT INTO public.tdecoder222 (a, b) VALUES (200, \u0026#39;lzl2\u0026#39;); UPDATE public.tdecoder222 SET a = 200, b = \u0026#39;lzlupdate\u0026#39; WHERE a = 200; \u0026ndash; After adding a primary key, update was correctly parsed by decoder_raw \u0026ndash; Without a primary key, it won\u0026rsquo;t be correctly parsed. This is related to replica identity, which will be introduced later.\nPrerequisites for Logical Replication # 1. Parameters # 1.1 Basic Required Parameters\nwal_level. Takes effect after restart, default is replica. The wal_level parameter must be logical. logical does not change WAL to logical; it means that on top of supporting physical replication (replica), the necessary information for logical decoding is added. Since pg9.6, there are only minimal, replica, and logical, with information content increasing successively. max_replication_slots. Takes effect after restart, default value below pg9.6 is 0, pg10 and above is 10. 10 is generally sufficient. Like physical replication, logical replication generally also uses replication slots. PostgreSQL backups and physical replication can both occupy replication slot counts. 1.2 Source-side Required Parameters\nmax_wal_senders. Takes effect after restart, default 10. Sender process count limit. The publisher\u0026rsquo;s sender transmits the parsed logs. Generally, one logical replication slot corresponds to one sender and one worker. This is similar to physical replication, where one physical replication slot corresponds to one sender and one receiver. 1.3 Target-side Required Parameters\nmax_worker_processes. Takes effect after restart, default 8. Worker process count limit. Parallel processes (parallel queries, parallel statistics collection, etc., limited by max_parallel_workers), logical replication worker processes (max_logical_replication_workers), and some other programs that need to fork workers are all related to this parameter. It should be set to max_parallel_workers + logical replication apply workers + other background workers. max_logical_replication_workers. Takes effect after restart, default 4. Logical replication worker process count, including logical replication apply worker processes and table sync worker processes. max_sync_workers_per_subscription. Takes effect after reload, default 2. Sync worker processes when adding new tables to logical replication. Currently, one table has only one parallel. The above three parameters are tiered: max_sync_workers_per_subscription \u0026lt; max_logical_replication_workers \u0026lt; max_worker_processes. In short, there must be workers available. 2. Permissions # Replication user permissions. Logical replication users need replication privileges. ALTER ROLE \u0026lt;usename\u0026gt; WITH REPLICATION;\nHBA access restrictions, allowing downstream to access the database using the replication user. host lzldb user1 172.17.100.150/32 md5\nFor the publish-subscribe model, CREATE permission on the database or superuser permission is needed. When creating a publication, for table only, at least the table owner with CREATE permission is needed. All other publications require superuser.\nWhen creating a subscription, superuser is required.\ngrant create on database lzl1db to owner1; or\nalter user replicate1 superuser;\nAdditionally, read or write permissions on tables during replication are also necessary. Logical Synchronization Between PostgreSQL Instances — Publish and Subscribe # PostgreSQL\u0026rsquo;s built-in logical replication is based on the publish-subscribe model. The publish-subscribe model does not parse into SQL for application.\nPublication # A publisher can have multiple publications, and each publication can have multiple tables. When publishing, you can specify: for table — publishes certain tables. New tables need to be explicitly added with ALTER PUBLICATION ADD TABLE. At minimum, the table owner is needed to create this publication.\nfor all tables — publishes all tables under the database. New tables are automatically published. Superuser is required to create this publication.\nfor all tables in schema — publishes all tables under the schema. New tables are automatically published. Superuser is required to create this publication. Supported starting from pg15.\nPublications by default include INSERT, UPDATE, DELETE, and TRUNCATE. You can also specify to replicate only certain commands. DDL is not synchronized. (Official documentation verbatim. This means truncate is not considered DDL in PostgreSQL — leaving this as a topic for later research. Truncate is DDL in MySQL and Oracle.) Only base tables can be published; temporary tables, foreign tables, views, sequences, etc. cannot be published. Partitioned table publishing is related to PostgreSQL version and partition attributes. pg15 defaults to publishing all partitions of a partitioned table. publish_via_partition_root. Supported from pg13. This publication parameter indicates whether partitioned tables use partitions for filtering (false, default) or use the parent partition for row filtering. If set to true, heterogeneous partitioned table logical replication is supported, such as partitioned table to regular table replication. truncate replication is not possible when true. Subscription # A subscription has only one publisher but can subscribe to multiple publications on the publisher. A subscriber can have multiple subscriptions, each receiving data from one replication slot. One subscription corresponds to one replication slot, which is on the publisher side. When creating or deleting a subscription, the replication slot is automatically created or deleted on the publisher by default. Creating a subscription requires superuser. DDL is not synchronized; tables must already be created. Existing data is synchronized by default, via COPY snapshot to the subscriber. Synchronization can be paused and resumed with ALTER SUBSCRIPTION sub1 {ENABLE|DISABLE}. When a publication adds new tables, refresh is needed on the subscriber side: alter subscription sub1 refresh publication. Schema names, table names, and column names must be consistent between publication and subscription. Column types can differ (as long as implicit conversion succeeds). Column order can be different. Subscriptions also have some attributes, such as binary transfer, streaming, synchronous commit, two-phase commit, etc. logical replication launcher is used to start the subscriber-side worker processes and only exists at startup. /*------------------------------------------------------------------------- ... * IDENTIFICATION * src/backend/replication/logical/launcher.c ... * NOTES * This module contains the logical replication worker launcher which * uses the background worker infrastructure to start the logical * replication workers for every enabled subscription. *------------------------------------------------------------------------- */ Publish-Subscribe Related Views # pg_publication; \u0026ndash; View publications. Publications themselves are stateless; replication slots are stateful, so there\u0026rsquo;s no pg_stat_publication.\npg_publication_tables \u0026ndash; View published tables, simple and clear.\npg_publication_rel \u0026ndash; View published tables, all IDs.\npg_stat_subscription \u0026ndash; View subscription status, pid is the worker process pid.\npg_subscription \u0026ndash; View subscriptions.\npg_subscription_rel \u0026ndash; View subscription tables. There\u0026rsquo;s no pg_subscription_tables. Additionally, this view can show the sync status of individual tables under a subscription, which the replication slot view cannot do.\n\\dRp list replication publications\n\\dRs list replication subscriptions\nCreating a Publication and Subscription # Using a dedicated replication user replicate1, create a publication and subscription in the database lzldb to implement logical replication of table trep1.\nRole Host IP Port Database Schema Table Replication User Version Publisher 172.17.100.150 5410 lzldb public trep1 replicate1 pg13 Subscriber 172.17.100.150 5412 lzlbd public trep1 replicate1 pg13 Creating the Publication # # Modify postgres.conf, wal_level parameter takes effect after restart wal_level=logical # Modify pg_hba.conf file, takes effect after reload host lzldb replicate1 172.17.100.150/32 md5 -- Create replication user and grant privileges create user replicate1 with password \u0026#39;replicate1\u0026#39;; alter user replicate1 with replication; grant create on database lzldb to replicate1; -- Create the table to be replicated and grant privileges to the replication user \\c lzldb replicate1 -- If the replication user is not the table owner, should grant select on trep1 to replicate1 create table trep1(a int primary key,b char(10)); insert into trep1 values(1,\u0026#39;abc\u0026#39;) -- Create publication, superuser can also be used \\c lzldb replicate1 create publication pub_lzl1 for table trep1; -- View publication. \\dRp or pg_publication lzldb=# select * from pg_publication; oid | pubname | pubowner | puballtables | pubinsert | pubupdate | pubdelete | pubtruncate | pubviaroot -------|----------|----------|--------------|-----------|-----------|-----------|-------------|----------- 16400 | pub_lzl1 | 16392 | f | t | t | t | t | f Creating the Subscription # -- Create table definition create table trep1(a int primary key,b char(10)); -- Use superuser to create subscription CREATE SUBSCRIPTION sub_test CONNECTION \u0026#39;host=172.17.100.150 port=5410 dbname=lzldb user=replicate1 password=replicate1\u0026#39; PUBLICATION pub_lzl1; lzlbd=# select * from pg_subscription; -- View subscription. \\dRs or pg_subscription oid | subdbid | subname | subowner | subenabled | subconninfo | subslotname | subsynccommit | subpublications -------|---------|----------|----------|------------|--------------------------------------------------------------------------------|-------------|---------------+----------------- 16394 | 16384 | sub_test | 10 | t | host=172.17.100.150 port=5410 dbname=lzldb user=replicate1 password=replicate1 | sub_test | off | {pub_lzl1} lzlbd=# select * from trep1; -- Verify existing data has been synchronized a | b ---+------------ 1 | abc Publish-Subscribe Model Test 1: Truncate Synchronization # lzldb=# truncate table trep1; TRUNCATE TABLE lzldb=# select * from trep1; a | b ---+--- (0 rows) lzlbd=# select * from trep1; -- In publish-subscribe mode, truncate is synchronized a | b ---+--- (0 rows) Publish-Subscribe Model Test 2: Adding New Table Synchronization # -- Under an existing publish-subscribe, add a new table synchronization. lzldb is publisher, lzlbd is subscriber lzldb=# create table tab_pk(a int,b varchar(10)); CREATE TABLE lzldb=# alter table tab_pk add primary key(a); ALTER TABLE lzldb=# alter publication pub_lzl1 add table tab_pk; ALTER PUBLICATION -- After adding a table on the publisher, refresh must be executed on the subscriber. Refresh defaults to synchronizing existing data lzlbd=# alter subscription sub_test refresh publication; ALTER SUBSCRIPTION lzlbd=# select * from pg_subscription_rel ; srsubid | srrelid | srsubstate | srsublsn ---------+---------+------------+----------- 16394 | 16389 | r | 0/15F2898 16394 | 16400 | d | -- Subscription state codes: i = initializing, d = copying data, s = synchronized, r = ready (normal replication) -- At this point, table tab_pk data has not been synchronized because the subscriber\u0026#39;s replication user lacks query permission on the table lzldb=# grant select on tab_full to replicate1; GRANT lzlbd=# select * from pg_subscription_rel ; srsubid | srrelid | srsubstate | srsublsn ---------+---------+------------+----------- 16394 | 16389 | r | 0/15F2898 16394 | 16400 | r | 0/172D830 -- Subscription is in ready state, new table synchronization complete Replica Identity # Replica identity is written into WAL logs to identify a row of data. Whether it\u0026rsquo;s publish-subscribe or third-party logical sync tools, they all need to locate rows in the table to identify which row downstream the update or delete affects.\nPostgreSQL supports 4 replica identity modes.\ndefault(d): Default identity for non-system tables. Uses primary key if the table has one; if no primary key, it\u0026rsquo;s nothing. index(i): Uses a non-null unique index as the identity. Must be non-null and unique to identify a row. If only unique, there can be multiple null values. You can also explicitly specify the primary key in index mode. full(f): Uses all columns of the row as the identity. Full mode increases WAL log volume. nothing(n): Default mode for system tables. No identity; update and delete cannot affect downstream. -- View table\u0026#39;s replica identity: select relname,relreplident from pg_class where relname=\u0026#39;tabname1\u0026#39;; -- When a table\u0026#39;s replica identity is i, check if the index is the replica identity: \\d tabname select rel.relname,idx.indisreplident from pg_index idx ,pg_class rel where idx.indexrelid=rel.oid and relname=\u0026#39;idx_1\u0026#39;; Modify table replica identity:\nALTER TABLE tab1 REPLICA IDENTITY { DEFAULT | USING INDEX index_name | FULL | NOTHING }; Replica Identity Test 1: Setting a non-null unique index as replica identity for a table without a primary key # lzldb=# create table tab_idx(a int,b varchar(10)); CREATE TABLE lzldb=# select relname,relreplident from pg_class where relname=\u0026#39;tab_idx\u0026#39;; relname | relreplident ---------+-------------- tab_idx | d lzldb=# create unique index idx_1 on tab_idx(b); CREATE INDEX lzldb=# alter table tab_idx alter b set not null; -- The index used as replica identity must be a non-null unique index ALTER TABLE lzldb=# select rel.relname,idx.indisreplident from pg_index idx ,pg_class rel where idx.indexrelid=rel.oid and relname=\u0026#39;idx_1\u0026#39;; relname | indisreplident ---------+---------------- idx_1 | f lzldb=# alter table tab_idx REPLICA IDENTITY using index idx_1; -- Modify table\u0026#39;s replica identity ALTER TABLE lzldb=# select rel.relname,idx.indisreplident from pg_index idx ,pg_class rel where idx.indexrelid=rel.oid and relname=\u0026#39;idx_1\u0026#39;; relname | indisreplident ---------+---------------- idx_1 | t lzldb=# \\d tab_idx -- pg_index or \\d to view index replica identity. \\d can only display explicitly modified index replica identity Table \u0026#34;public.tab_idx\u0026#34; Column | Type | Collation | Nullable | Default --------+-----------------------+-----------+----------+--------- a | integer | | | b | character varying(10) | | not null | Indexes: \u0026#34;idx_1\u0026#34; UNIQUE, btree (b) REPLICA IDENTITY Replica Identity Test 2: Full mode — can duplicate rows be synchronized normally? # -- Execute the following on the publisher lzldb=# create table tab_full (a int,b varchar(10)); -- Add table sync without primary key and non-null index CREATE TABLE lzldb=# insert into tab_full values(1,\u0026#39;abc\u0026#39;); -- Insert 5 identical rows INSERT 0 1 lzldb=# grant select on tab_full to replicate1; GRANT lzldb=# alter publication tab_full add table tab_pk; ALTER PUBLICATION -- lzlbd=# alter subscription sub_test refresh publication; ALTER SUBSCRIPTION lzlbd=# select ctid,* from tab_full ; ctid | a | b -------+---+----- (0,1) | 1 | abc (0,2) | 1 | abc (0,3) | 1 | abc (0,4) | 1 | abc (0,5) | 1 | abc lzldb=# delete from tab_full where ctid=\u0026#39;(0,2)\u0026#39;; ERROR: cannot delete from table \u0026#34;tab_full\u0026#34; because it does not have a replica identity and publishes deletes HINT: To enable deleting from the table, set REPLICA IDENTITY using ALTER TABLE. lzldb=# update tab_full set a=2 where ctid=\u0026#39;(0,5)\u0026#39;; ERROR: cannot update table \u0026#34;tab_full\u0026#34; because it does not have a replica identity and publishes updates HINT: To enable updating the table, set REPLICA IDENTITY using ALTER TABLE. -- When the table\u0026#39;s replica identity is d(default), without a primary key it\u0026#39;s nothing. nothing cannot replicate delete and update. lzldb=# alter table tab_full replica identity full; ALTER TABLE lzldb=# delete from tab_full where ctid=\u0026#39;(0,2)\u0026#39;; -- After setting replica identity to full, delete succeeds DELETE 1 lzlbd=# select ctid,* from tab_full ; -- ctid | a | b -------+---+----- (0,2) | 1 | abc (0,3) | 1 | abc (0,4) | 1 | abc (0,5) | 1 | abc lzldb=# update tab_full set a=2 where ctid=\u0026#39;(0,5)\u0026#39;; UPDATE 1 lzldb=# select ctid,* from tab_full; ctid | a | b -------+---+----- (0,1) | 1 | abc (0,3) | 1 | abc (0,4) | 1 | abc (0,6) | 2 | abc lzlbd=# select ctid,* from tab_full ; ctid | a | b -------+---+----- (0,3) | 1 | abc (0,4) | 1 | abc (0,5) | 1 | abc (0,6) | 2 | abc \u0026ndash; This example proves 3 points: \u0026ndash; 1. When replica identity is d(default), it defaults to primary key; if no primary key, it\u0026rsquo;s nothing. \u0026ndash; 2. nothing cannot replicate delete and update. \u0026ndash; 3. Duplicate data in full mode can also be normally logically replicated. Although the ctid of data rows differs, the replication goal is still achieved.\nThird-Party Synchronization Software # Third-party synchronization software already has relatively mature solutions and is widely used, such as OGG, DTS, KTL, etc.\nThese sync tools are very flexible. They can achieve true heterogeneous synchronization, from PostgreSQL databases to different databases or Kafka, big data consumption platforms, etc.\nOf course, they can also sync from other architecture data platforms to PostgreSQL databases, such as the now common Oracle to PostgreSQL sync scenario.\nSince we\u0026rsquo;re mainly discussing the PostgreSQL database itself, when PostgreSQL acts as the downstream target, it\u0026rsquo;s just some data write issues with very few problems. There won\u0026rsquo;t be logical decoding, replication slot issues, etc. So this small section won\u0026rsquo;t discuss PostgreSQL as a heterogeneous sync target. We\u0026rsquo;ll only observe and summarize scenarios where PostgreSQL acts as the upstream syncing to heterogeneous databases. These third-party tools generally utilize PostgreSQL\u0026rsquo;s own logical decoding, specify their own output plugin, and automatically create replication slots and replication links. Some tools automatically create subscriptions, while others only have replication slots without subscriptions.\nHaving already understood logical decoding, output plugins, replication slots, replica identity, and prerequisites for replication, let\u0026rsquo;s simulate a PostgreSQL to Oracle sync by directly configuring the prerequisites and starting synchronization.\nCreating OGG Sync from PostgreSQL to Oracle # Software Installation:\nogg for oracle: Oracle GoldenGate 21.3.0.0.0 for Oracle on Linux x86-64\nogg for pg: Oracle GoldenGate 21.3.0.0.0 for PostgreSQL on Linux x86-64\noracle: 11.2.0.4\npg: 13.10\nInstallation steps:\nOGG installation and deployment won\u0026rsquo;t be introduced here. I followed the article\u0026rsquo;s installation steps step by step. Installation article reference: https://liuzhilong.blog.csdn.net/article/details/129252320?spm=1001.2014.3001.5502\nSync architecture diagram:\nlzldb=# select * from pg_replication_slots where slot_name=\u0026#39;ext_pg_5d4b1d39f7494f79\u0026#39;; -[ RECORD 1 ]-------+------------------------ slot_name | ext_pg_5d4b1d39f7494f79 plugin | test_decoding -- OGG defaults to using test-decoding slot_type | logical datoid | 16385 database | lzldb temporary | f active | t -- As long as OGG extract is running, the replication slot is active active_pid | 3509 xmin | catalog_xmin | 591 restart_lsn | 0/17F3E38 confirmed_flush_lsn | 0/17F4020 wal_status | reserved safe_wal_size | select * from pg_stat_replication -[ RECORD 2 ]----+------------------------------ pid | 3509 usesysid | 10 usename | pg application_name |GoldenGateCapture client_addr | 127.0.0.1 client_hostname | client_port | 43665 backend_start | 2023-02-28 15:12:17.350469+08 backend_xmin | state | streaming sent_lsn | 0/17F4140 write_lsn | 0/17F4020 flush_lsn | 0/17F4020 replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async reply_time | 2023-02-28 16:39:44.986625+08 -- replay_lsn has no value -- Even lag has no value Logical Replication Monitoring # An important method for logical replication lag monitoring is checking lag from the replication software. Without that, you can only check from the replication slot view. The replication slot view provides quite a lot of information, such as whether the replication slot is active directly indicating whether the replication link is syncing.\nThe replication slot view is very important for logical replication monitoring. Some additional monitoring for publish-subscribe was introduced earlier. Here we focus on broader logical replication monitoring.\npg_replication_slots # The replication slot view shows information about each replication slot and some slot statuses. Manually created slots or slots automatically created by tools and subscriptions are all displayed here.\nslot_name Replication slot name plugin Output plugin name for logical replication slots. If empty, it\u0026rsquo;s a physical replication slot slot_type physical or logical datoid Database ID for logical replication slot database Database for logical replication slot temporary Whether it\u0026rsquo;s a temporary replication slot. Temporary slots are not written to disk and are automatically deleted when the session ends. pg_basebackup uses temporary slots by default active Replication slot status: t or f. If f, you should quickly consider restarting the replication link or deleting it, as it may block WAL log deletion and fill up the primary database disk. This is related to the max_slot_wal_keep_size parameter active_pid walsender PID using this replication slot. Only present when the slot status is t xmin Minimum transaction ID the slot needs to hold catalog_xmin Minimum catalog transaction ID the slot needs to hold restart_lsn LSN position of WAL the slot needs to retain to ensure downstream consumer\u0026rsquo;s required WAL won\u0026rsquo;t be cleaned. max_slot_wal_keep_size parameter is the maximum WAL size the slot needs to retain. Beyond this value, WAL can also be deleted. Default -1 means never cleaned. This value represents the LSN position after the downstream\u0026rsquo;s latest checkpoint consumption and can help locate replication link lag confirmed_flush_lsn LSN confirmed received by the logical replication downstream. Empty for physical replication slots wal_status Status of WAL claimed by this replication slot reserved: the slot reserves WAL, WAL hasn\u0026rsquo;t exceeded max_wal_size (auto-checkpoint interval) extended: the slot reserves WAL, WAL has exceeded max_wal_size but the slot still retains it. WAL in this state is still within wal_keep_size or max_slot_wal_keep_size unreserved: the slot no longer retains needed WAL, WAL will be deleted at next checkpoint lost: WAL needed by the slot has been cleaned, slot is invalid. The last two states are seen only when max_slot_wal_keep_size is non-negative. This is easy to understand, since max_slot_wal_keep_size is the criterion for whether WAL can be deleted. Without a mechanism to delete slot WAL, unreserved and lost states wouldn't appear. If restart_lsn is NULL, this field is null. Also easy to understand — if there's no WAL LSN, you can't know the WAL retention position or judge whether WAL has exceeded wal_keep_size or max_slot_wal_keep_size. safe_wal_size Number of WAL bytes that can be written before WAL files would be deleted. If this value is negative or zero, it means max_slot_wal_keep_size has been exceeded, and WAL files will be deleted as soon as a checkpoint occurs, requiring the standby using this slot to be rebuilt pg_stat_replication # Rather than replication status, it\u0026rsquo;s more accurate to call it walsender status. This view shows the status of each walsender, one record per walsender.\nIf present in pg_replication_slots but not in pg_stat_replication, the walsender is gone; logical replication is down; pg_replication_slots active should be f. If absent in pg_replication_slots but present in pg_stat_replication, this is physical replication without a replication slot. You can have replication stat info without a replication slot. Replication slots with walsenders also need this view because it reveals more replication status info than pg_replication_slots.\nSo when the replication slot hasn\u0026rsquo;t failed, pg_stat_replication is very important for monitoring logical replication lag.\npid walsender PID, same as pg_replication_slots active_pid usesysid User OID connected to this walsender, i.e., the downstream\u0026rsquo;s replication user OID usename Username connected to this walsender application_name Downstream application name. If subscription, it\u0026rsquo;s the subscription name. If pg_recvlogical, it\u0026rsquo;s pg_recvlogical client_addr Downstream IP. If empty, it\u0026rsquo;s a local socket connection client_hostname Downstream hostname client_port Downstream port. If -1, it\u0026rsquo;s a local socket connection backend_start Backend start time, i.e., when downstream connected to walsender backend_xmin Standby\u0026rsquo;s xmin when hot_standby_feedback is enabled. This is clearly for physical replication state States are relatively easy to understand. startup: walsender starting. catchup: walsender catching up with primary logs. streaming: walsender has caught up with primary logs, normal replication state. backup: walsender sending backup, this state appears for walsender used for backup. stopping: walsender stopping sent_lsn LSN sent write_lsn LSN written to disk by downstream flush_lsn LSN flushed to disk by downstream replay_lsn LSN replayed by downstream write_lag Log lag between primary flush wal and downstream write flush_lag Log lag between primary flush wal and downstream flush replay_lag Log lag between primary flush wal and downstream relay sync_priority Synchronization priority sync_state Synchronization state reply_time Last reply time Relationship between sent_lsn, write_lsn, flush_lsn, replay_lsn # The above nicely shows the hierarchical relationship of sent_lsn, write_lsn, flush_lsn.\nThese monitoring metrics look very much like streaming replication. For logical replication, sent_lsn, write_lsn, flush_lsn also generally have values.\nHowever, when logical replication doesn\u0026rsquo;t know what the downstream is, the replay log replay action may not exist, so logical replication may not have replay_lsn.\nBut one thing is confirmed effective: sent_lsn.\nAfter reviewing pg_replication_slots and pg_stat_replication view monitoring, we find that neither shows log parsing delay; at most, you can see log transmission delay.\npg_stat_replication_slots # This view has been available since pg14. It specifically monitors logical replication slot status and can additionally monitor spill status. For pg13, you can only check the pg_replslot directory. Spill will be introduced later.\nLogical Replication Slot Transaction Snapshots and pg_logical Directory # The transaction snapshots needed by replication slots are persisted to disk. The source code is in snapbuild.c.\nvoid SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn) { if (builder-\u0026gt;state \u0026lt; SNAPBUILD_CONSISTENT) SnapBuildRestore(builder, lsn); else SnapBuildSerialize(builder, lsn); } Snap persistence has two behaviors: one is restore, loading from disk to memory; the other is serialize, persisting from memory to disk.\nTransaction snapshot persistence:\nSnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn) ... sprintf(path, \u0026#34;pg_logical/snapshots/%X-%X.snap\u0026#34;, (uint32) (lsn \u0026gt;\u0026gt; 32), (uint32) lsn); ... else if (ret == 0) { /* * somebody else has already serialized to this point, don\u0026#39;t overwrite * but remember location, so we don\u0026#39;t need to read old data again. * * To be sure it has been synced to disk after the rename() from the * tempfile filename to the real filename, we just repeat the fsync. * That ought to be cheap because in most scenarios it should already * be safely on disk. */ fsync_fname(path, false); fsync_fname(\u0026#34;pg_logical/snapshots\u0026#34;, true); builder-\u0026gt;last_serialized_snapshot = lsn; goto out; } Transaction snapshot loading into memory:\nSnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn) ... if (builder-\u0026gt;state == SNAPBUILD_CONSISTENT) return false; sprintf(path, \u0026#34;pg_logical/snapshots/%X-%X.snap\u0026#34;, (uint32) (lsn \u0026gt;\u0026gt; 32), (uint32) lsn); fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); The transactions needed by logical replication slots, before being committed, store dirty transaction data and unconsumed data under pg_logical/snapshots/. After committing data or starting the replication slot, data is handed to reorderbuffer; or after cleaning the replication slot, the data is released.\nMy environment has a long-unused slot with restart_lsn at 0/1776858:\npostgres=# select slot_name,plugin,slot_type,database,active,restart_lsn from pg_replication_slots where slot_name=\u0026#39;logical_test\u0026#39;; slot_name | plugin | slot_type | database | active | restart_lsn --------------+---------------+-----------+----------+--------+------------- logical_test | test_decoding | logical | lzldb | f | 0/1776858 The oldest snapshot under pg_logical/snapshots/ is it:\n[pg@lzl snapshots]$ ll total 300 -rw------- 1 pg pg 144 Feb 23 20:41 0-1776858.snap -rw------- 1 pg pg 144 Feb 23 20:44 0-1776900.snap -rw------- 1 pg pg 144 Feb 23 20:45 0-1776938.snap Delete unwanted replication slot:\nselect pg_drop_replication_slot(\u0026#39;logical_test\u0026#39;); After a few minutes, snap is deleted:\n[pg@lzl snapshots]$ ll 0-1776858.snap ls: cannot access 0-1776858.snap: No such file or directory Logical Decoding Working Memory and Spill to pg_replslot # logical_decoding_work_mem # Before pg13, logical decoding would retain at most 4096 changes in memory (max_changes_in_memory hardcoded). Beyond 4096 changes, transaction data would be written to disk.\npg13 introduced the logical_decoding_work_mem parameter. Working memory used by logical decoding. All walsender decoding uses this shared memory area. If the data held by logical decoding exceeds this memory value, it\u0026rsquo;s written to disk. Logical decoding working memory size defaults to 64MB.\nRelated ReorderBuffer and Spill # Description in reorderbuffer.c:\n* This module gets handed individual pieces of transactions in the order * toplevel transaction sized pieces. When a transaction is completely * reassembled - signaled by reading the transaction commit record - it * will then call the output plugin (cf. ReorderBufferCommit()) with the * individual changes. The output plugins rely on snapshots built by * snapbuild.c which hands them to us. When a transaction commits, reorderbuffer can receive transaction entries and sort them, then send data changes to the output plugin for output. The output plugin relies on snapshots built by snapbuild.c, which are handed to reorderbuffer.\n/* * Maximum number of changes kept in memory, per transaction. After that, * changes are spooled to disk. * * The current value should be sufficient to decode the entire transaction * without hitting disk in OLTP workloads, while starting to spool to disk in * other workloads reasonably fast. * * At some point in the future it probably makes sense to have a more elaborate * resource management here, but it\u0026#39;s not entirely clear what that would look * like. */ int logical_decoding_work_mem; static const Size max_changes_in_memory = 4096; /* XXX for restore only */ When parsed data exceeds logical_decoding_work_mem, it\u0026rsquo;s written to disk. max_changes_in_memory is hardcoded at 4096, now only used to trigger disk restore. In pg12 source, there\u0026rsquo;s no int logical_decoding_work_mem, and subsequent serialization was also judged based on max_changes_in_memory.\nIn pg13, Disk serialization source code starts from line 2333. When parsed data in memory exceeds logical_decoding_work_mem, the largest transaction is spilled to disk. ReorderBufferLargestTXN(rb) finds the largest transaction. ReorderBufferSerializeTXN(rb, txn) persists this transaction. The immediately following code is ReorderBufferSerializeTXN():\n/* * Spill data of a large transaction (and its subtransactions) to disk. */ static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { dlist_iter subtxn_i; dlist_mutable_iter change_i; int fd = -1; XLogSegNo curOpenSegNo = 0; Size spilled = 0; elog(DEBUG2, \u0026#34;spill %u changes in XID %u to disk\u0026#34;, (uint32) txn-\u0026gt;nentries_mem, txn-\u0026gt;xid); /* do the same to all child TXs */ ... At debug2 level, spill logs are output:\n/* * Given a replication slot, transaction ID and segment number, fill in the * corresponding spill file into \u0026#39;path\u0026#39;, which is a caller-owned buffer of size * at least MAXPGPATH. */ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid, XLogSegNo segno) { XLogRecPtr recptr; XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr); snprintf(path, MAXPGPATH, \u0026#34;pg_replslot/%s/xid-%u-lsn-%X-%X.spill\u0026#34;, NameStr(MyReplicationSlot-\u0026gt;data.name), xid, (uint32) (recptr \u0026gt;\u0026gt; 32), (uint32) recptr); } Persisted to pg_replslot/replication_slot_name/xid-%u-lsn-%X-%X.spill.\nSimilarly, besides serialize, there\u0026rsquo;s also restore:\n/* * Restore a number of changes spilled to disk back into memory. */ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn, TXNEntryFile *file, XLogSegNo *segno) { Size restored = 0; XLogSegNo last_segno; ... while (restored \u0026lt; max_changes_in_memory \u0026amp;\u0026amp; *segno \u0026lt;= last_segno) { int readBytes; ReorderBufferDiskChange *ondisk; ... /* * Read the statically sized part of a change which has information * about the total size. If we couldn\u0026#39;t read a record, we\u0026#39;re at the * end of this file. */ ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange)); readBytes = FileRead(file-\u0026gt;vfd, rb-\u0026gt;outbuf, sizeof(ReorderBufferDiskChange), file-\u0026gt;curOffset, WAIT_EVENT_REORDER_BUFFER_READ); ... /* * ok, read a full change from disk, now restore it into proper * in-memory format */ ReorderBufferRestoreChange(rb, txn, rb-\u0026gt;outbuf); restored++; } return restored; } ReorderBufferRestoreChanges() just does judgment and looping (restored++), calling ReorderBufferRestoreChange():\nstatic void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, char *data) { ... /* * Update memory accounting for the restored change. We need to do this * although we don\u0026#39;t check the memory limit when restoring the changes in * this branch (we only do that when initially queueing the changes after * decoding), because we will release the changes later, and that will * update the accounting too (subtracting the size from the counters). And * we don\u0026#39;t want to underflow there. */ ReorderBufferChangeMemoryUpdate(rb, change, true, ReorderBufferChangeSize(change)); } Looking at ReorderBufferRestoreChanges(), its while loop judgment is restored \u0026lt; max_changes_in_memory, and restored starts at 0. It will loop 4096 times. There\u0026rsquo;s a comment in ReorderBufferRestoreChange explaining that although restore isn\u0026rsquo;t based on memory limit, it still needs to update memory usage to prevent underflow. Meaning: since I just restored it, don\u0026rsquo;t spill it again in a nested fashion. (It feels a bit odd — clearly judging by memory limit would be better rather than hardcoding the restore loop count.)\nInterpreting the logical decoding process based on source code:\nxtransaction snap preserves the metadata needed for parsing locks. When the replication slot is inactive or the transaction is uncommitted, snap persists to pg_logical/snapshots/%restart_lsn.snap. After the replication slot restarts or the transaction commits, the transaction snap metadata on disk is read into memory and sent to reorderbuffer for WAL parsing, sorted by transaction start order. If logical decoding data fills up the logical_decoding_work_mem memory area, change entries persist the largest transaction to pg_replslot/slot_name/xid-%u-lsn-%X-%X.spill, send other in-memory transactions to the output plugin for format conversion, and finally send the decoded information to the downstream.\nIn fact, we can see that long transactions and large transactions can make the entire logical replication link very slow. Large transactions are preferentially spilled to disk, then loaded back from disk to memory after the transaction completes.\nSummary # Logical replication is managed through replication slots: one replication slot, one walsender process, one output plugin. The output plugin determines the output form of logically decoded data, specified when creating the replication slot. Replica identity priority recommendation: primary key -\u0026gt; non-null unique index -\u0026gt; full. The publish-subscribe model is PostgreSQL\u0026rsquo;s built-in logical replication, using pgoutput by default. Publications can be used independently. The publisher process is walsender, and the subscriber process is worker. Pay attention to their respective process parameters. There are many third-party logical replication tools; they generally use PostgreSQL\u0026rsquo;s logical decoding system. For monitoring replication links, pay attention to pg_replication_slots and pg_stat_replication. The pg_logical directory stores transaction parsing metadata snaps, waiting for transaction commit before parsing. The pg_replslot directory stores transaction information exceeding logical_decoding_work_mem, called spill. References # Book: 《PostgreSQL实战》\nOfficial Documentation:\nPostgreSQL: Documentation: 15: Chapter 49. Logical Decoding\nPostgreSQL: Documentation: 15: 49.1. Logical Decoding Examples\nPostgreSQL: Documentation: 15: pg_recvlogical\nPostgreSQL: Documentation: 14: 52.81. pg_replication_slots\nPostgreSQL: Documentation: 13: 19.6. Replication\nPostgreSQL: Documentation: 13: 48.6. Logical Decoding Output Plugins\nPostgreSQL: Documentation: 15: 31.1. Publication\nPostgreSQL: Documentation: 15: 31.2. Subscription\nPostgreSQL: Documentation: 15: CREATE PUBLICATION\nHighly Recommended:\nhttps://www.pgconf.asia/JA/2017/wp-content/uploads/sites/2/2017/12/D2-A7-EN.pdf\nLogical replication internals | Select * from Adrien\nAn Overview of Logical Replication in PostgreSQL - Highgo Software Inc.\nDiscussing Logical Decoding from Real Cases\nLong-Troubling Logical Decoding Anomalies\nMonitoring replication: pg_stat_replication - CYBERTEC\nOther References:\nhttps://zhuanlan.zhihu.com/p/311496301\nA Guide to PostgreSQL Change Data Capture - DZone\nChange data capture in Postgres: How to use logical decoding and wal2json - Microsoft Community Hub\nPgSQL · The Secrets of PostgreSQL Logical Streaming Replication Technology · Database Kernel Monthly · KanCloud\nAnalyzing PostgreSQL Logical Replication Principles - CSDN Blog\nhttp://pigsty.cc/zh/blog/2021/03/03/postgres逻辑复制详解/\nLogical replication and logical decoding - Azure Database for PostgreSQL - Flexible Server | Microsoft Learn\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/postgresql-logical-replication/","section":"Posts","summary":"What is Logical Replication # PostgreSQL logical replication is based on logical decoding, which parses WAL log streams into a specified format for output. The subscriber node receives the parsed data and applies it.\nLogical replication differs from streaming replication (physical replication) which is based on instance-level primary-standby where the physical structures are identical. Logical replication can selectively replicate at the table level. Logical Replication in official documentation specifically refers to the “publish-subscribe” model. In fact, many tools can use logical decoding for heterogeneous database data synchronization.\n","title":"PostgreSQL Logical Replication","type":"posts"},{"content":" What is PostgreSQL Streaming Replication? # Streaming Replication is a method for transmitting WAL logs introduced in PostgreSQL 9.0. As soon as the primary database generates a log, it is immediately passed to the standby database. Before PostgreSQL 9.0, PostgreSQL could only transfer WAL logs one at a time (log shipping), and the standby database lagged behind the primary by at least one WAL log. PostgreSQL Streaming Replication Processes # wal sender: The wal sender exists on the primary database. The wal sender process transmits the WAL between the primary\u0026rsquo;s latest LSN and the standby\u0026rsquo;s latest LSN to the standby. wal receiver: The wal receiver exists on the standby database. The wal receiver process transmits the standby\u0026rsquo;s latest LSN to the primary. The wal receiver receives WAL data passed by the wal sender and writes it to WAL logs. startup: The standby instance recovery process. It replays WAL logs on the standby database.\npg 16776 14632 0 13:33 ? 00:00:00 postgres: wal sender process lzl 172.17.100.150(13338) streaming 0/3002D30 pg 16775 15329 0 13:33 ? 00:00:00 postgres: wal receiver process streaming 0/3002D30 pg 15330 15329 0 10:26 ? 00:00:00 postgres: startup process recovering 000000010000000000000003 PostgreSQL Streaming Replication Principles # PostgreSQL streaming replication is primarily divided into two phases: the instance recovery phase and the primary-standby synchronization phase. Instance Recovery Phase: When a PostgreSQL database crashes abnormally, upon startup, PostgreSQL replays all WAL logs after the last checkpoint before the crash (this is the same principle as instance recovery in Oracle, MySQL, and other relational databases — the goal is to bring the database to a consistent state). When setting up a PostgreSQL standby database, the primary is generally not shut down. At this point, the backup taken from the primary is in an inconsistent state, and the startup process performs instance recovery when the standby starts. Primary-Standby Synchronization Phase: The wal receiver process transmits the standby\u0026rsquo;s latest LSN to the primary. The wal sender transmits the WAL between the primary\u0026rsquo;s latest LSN and the standby\u0026rsquo;s latest LSN to the wal receiver. The wal receiver receives the WAL and writes it to disk, and the startup process replays the WAL logs on the standby.\nSynchronous and Asynchronous # PostgreSQL primary-standby has 5 modes, controlled by the synchronous_commit parameter. The essence of the synchronous_commit parameter is to control when the primary commits. remote_apply: The primary commits only after all standby databases have applied the WAL. This mode is synchronous — the primary and standby are consistent. Data that can be queried on the primary can definitely also be queried on the standby. In this mode there is no primary-standby lag, but it affects the primary commit time because the primary commit needs to wait for network transmission and standby application time.\nThe meaning of synchronous_commit has two scenarios: with and without standby databases (when synchronous_standby_names is empty or non-empty):\nWhen synchronous_standby_names is non-empty: remote_apply: The standby has applied the WAL, only then can the primary commit. In this mode the primary and standby are synchronous. on: default. The primary commits when both primary and standby WAL have been written to disk. Similar to semi-synchronous, no data will be lost. remote_write: The primary commits when the standby has received the WAL and written the WAL log to the filesystem cache. At this point the standby has received the WAL but hasn\u0026rsquo;t flushed it to disk yet. If the OS crashes, data will be lost. local: The primary commits when its WAL is flushed to disk. This mode is asynchronous — the primary doesn\u0026rsquo;t need to confirm the standby\u0026rsquo;s status before committing. off: The primary can commit without its own WAL being flushed to disk. There is a risk of data loss. Not recommended.\nWhen synchronous_standby_names is empty: (When synchronous_standby_names is empty, only on and off are effective for synchronous_commit. If set to remote_apply, remote_write, or local, they are still treated as on.) on: default. The database WAL must be written to disk before a transaction can commit. off: The primary can commit without its own WAL being flushed to disk. There is a risk of data loss. Not recommended.\nPrimary-Standby Synchronization Relationship Primary-Standby Reliability Failover # When the primary crashes, the standby needs to initiate failover, at which point the standby becomes the new primary. PostgreSQL does not provide a method to detect failures, but it does provide a method to activate the primary. (Typically, third-party tools call the PostgreSQL activation method, while primary-standby monitoring, primary crash detection, connection switching, etc. are not handled by PostgreSQL itself.) PostgreSQL provides 2 methods to activate a standby as the primary: the trigger_file file and the pg_ctl promote command. (In PostgreSQL 12 and later, trigger_file becomes promote_trigger_file.) Both trigger_file and pg_ctl promote can complete the task of activating the standby with a single command. The difference is that trigger_file requires the trigger_file configuration to be written in recovery.conf in advance. Using trigger_file for primary-standby switchover (pg_ctl promote has the same effect and is simpler):\nConfigure trigger_file in the standby\u0026rsquo;s recovery.conf Shut down the primary touch trigger_file to start the old standby as the new primary Configure recovery.conf to start the old primary as the new standby Observe the new and old primary/standby databases Failover Example: Environment: Primary\t172.17.100.150\t5432 Standby\t172.17.100.150\t5433\n1. Configure trigger_file in standby recovery.conf\n$ cat recovery.conf|grep trigger trigger_file = \u0026#39;/pg/pg96data_sla/trigger.kenyon\u0026#39; $ ll /pg/pg96data_sla/trigger.kenyon ls: cannot access /pg/pg96data_sla/trigger.kenyon: No such file or directory Simply configure the trigger file path in recovery.conf. The trigger file won\u0026rsquo;t appear until it\u0026rsquo;s created.\nAdd configuration to standby postgres.conf\nmax_wal_senders = 6 #max_wal_senders is the maximum number of sender processes, default is 0, so the standby must configure this before switchover hot_standby=on #Enable query functionality on standby 2. Shut down the primary\n$ pg_ctl stop -D /pg/pg96data_pri -m fast waiting for server to shut down.... done server stopped (Check if primary WAL has been fully applied by the standby: pg9.6- cd pg_xlog; pg 10+ cd pg_wal)\nls -ltr|tail -n 1 |awk \u0026#39;{print $NF}\u0026#39;|while read xlog;do pg_xlogdump $xlog;done Look for the keyword \u0026ldquo;shutdown\u0026rdquo; in the standby\u0026rsquo;s WAL\n3. touch to activate standby (or pg_ctl promote -D /pg/pg96data_sla)\n$ touch /pg/pg96data_sla/trigger.kenyon At this point recovery.conf becomes recovery.done\n4. Set up primary as standby Configure the new standby\u0026rsquo;s recovery.conf file. You can directly copy from the old standby and modify the IP and directory.\nvi $新备库/recover.conf standby_mode = on primary_conninfo = \u0026#39;host=172.17.100.150 port=5433 user=lzl password=lzl\u0026#39; recovery_target_timeline = \u0026#39;latest\u0026#39; Configure postgres.conf, write hot_standby = on to enable queries on the standby\nvi $新备库/postgres.conf hot_standby = on Start the new standby\n/pg/pg96/bin/pg_ctl -D /pg/pg96data_pri -l /pg/pg96data_pri/server.log start 5. Check primary and standby\npostgres=# \\x Expanded display is on. postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 24766 usesysid | 16384 usename | lzl application_name | walreceiver client_addr | 172.17.100.150 client_hostname | client_port | 47345 backend_start | 2021-07-30 07:44:05.582546+00 backend_xmin | state | streaming sent_location | 0/4033790 write_location | 0/4033790 flush_location | 0/4033790 replay_location | 0/4033790 sync_priority | 0 sync_state | async pg_basebackup # pg_basebackup is PostgreSQL\u0026rsquo;s built-in backup tool for performing base backups. pg_basebackup can be used for PITR and also for constructing log-shipping standby and streaming standby. It is PostgreSQL\u0026rsquo;s physical backup tool. https://liuzhilong.blog.csdn.net/article/details/119533506\npg_rewind # pg_rewind can be used as a maintenance tool for PostgreSQL primary-standby setups. When the timelines of two PostgreSQL instances diverge, pg_rewind can synchronize between the instances. (For example, if the standby is running after failover while the primary was still running, the timelines of primary and standby will have diverged.) https://liuzhilong.blog.csdn.net/article/details/119250794\nReplication Slots # What are PostgreSQL Replication Slots? In a primary-standby architecture, if the standby hasn\u0026rsquo;t received WAL logs yet but the primary has already deleted them, such lag cannot be automatically recovered. Replication slots ensure that the primary won\u0026rsquo;t delete WAL logs that haven\u0026rsquo;t been transmitted to the standby yet. Without replication slots, you might need to use wal_keep_size/wal_keep_segments and archive_command to ensure WAL logs aren\u0026rsquo;t deleted, but this approach always retains too many WAL files and cannot guarantee that WAL won\u0026rsquo;t be deleted when lag is significant. This is exactly why replication slots were created. However, replication slots may cause the primary to never delete WAL (e.g., if the standby has crashed), causing disk space to fill up. In this case, max_slot_wal_keep_size is needed to set an upper limit on WAL file retention.\nReplication Slot Parameters: max_slot_wal_keep_size: When replication slots are in use, this parameter defines the maximum size of WAL files in the pg_wal directory. The default value is -1, meaning there is no upper limit on the size of WAL files retained by the primary for the standby. wal_keep_segments/wal_keep_size: PostgreSQL 12 and below use wal_keep_segments, PostgreSQL 13 and above use wal_keep_size. Ensures that WAL files under pg_wal are not deleted. Without replication slots, WAL files exceeding this size may be deleted, potentially causing the standby to be unable to catch up. If set too large, it may cause the directory to grow excessively. The default is 0, meaning WAL files are not retained. If WAL is deleted, the following error may occur: ERROR: requested WAL segment xxxx has already been removed At this point the standby can only hope for archives; otherwise, it must be rebuilt. primary_slot_name: Sets the slot name, indicating that the PostgreSQL primary-standby setup uses replication slots. So enabling PostgreSQL replication slots requires at least the following configuration: primary_conninfo = \u0026lsquo;host=172.17.100.150 port=5433 user=lzl password=lzl\u0026rsquo; primary_slot_name = \u0026lsquo;pg_slot_lzl\u0026rsquo; max_replication_slots: The maximum number of replication slots. Takes effect upon restart. If there aren\u0026rsquo;t enough replication slots, the standby will fail to start. This value should be set relatively high. In PostgreSQL versions below 9.6, the default is 0; in PostgreSQL 10 and above, it\u0026rsquo;s 10.\nCreating PostgreSQL Replication Slots\n1. Set max_replication_slots on the primary Primary: (my PostgreSQL version is 9.6) max_replication_slots=10 Add to postgres.conf and restart the primary\n2. Create replication slot Create replication slot:\npostgres=# SELECT * FROM pg_create_physical_replication_slot(\u0026#39;pg_slot_lzl\u0026#39;); slot_name | xlog_position -------------+--------------- pg_slot_lzl | View replication slot\npostgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots; slot_name | slot_type | active -------------+-----------+-------- pg_slot_lzl | physical | f 3. Set primary_slot_name on the standby primary_slot_name = 'pg_slot_lzl' Add to recovery.conf and restart the standby\n4. Check replication slot\npostgres=# select *,pg_xlogfile_name(restart_lsn)as current_xxlog from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | current_xxlog -------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------+-------------------------- pg_slot_lzl | | physical | | | t | 12802 | | | 0/A002340 | | 00000002000000000000000A --pg_xlogfile_name(restart_lsn) to view current WAL log info Query Conflicts # What are Query Conflicts? The standby may encounter the following error during queries: ERROR：canceling statement due to conflict with recovery\nWhy do conflicts occur? Let\u0026rsquo;s think carefully. For example, if the standby is executing a query based on a certain table (this query could be from an application or a manual connection), and the primary executes a drop table operation, this operation is written to WAL logs and transmitted to the standby for application. To ensure data consistency, PostgreSQL will inevitably replay the data quickly, at which point the drop table and select will conflict, as shown below: Conflict scenarios: The above only introduces one type of query conflict. To summarize, there are several situations:\nPrimary exclusive locks (including explicit LOCK commands and various DDL operations) Primary vacuum cleaning up dead tuples — if the standby is using those tuples, a conflict will occur Primary drops the tablespace that the standby query is using Primary drops the database that the standby is using Consider a primary-only scenario: Scenario 1: A session issues a drop table and finds that a select statement is currently executing. The session can only wait for the select to complete its transaction. Scenario 2: A session issues a vacuum or automatic background vacuum — it won\u0026rsquo;t conflict with current database queries because vacuum won\u0026rsquo;t clean up tuples that are in use.\nThe standby\u0026rsquo;s handling is different. Because the primary doesn\u0026rsquo;t know the standby\u0026rsquo;s transaction status, and the standby needs to stay consistent with the primary, this is why \u0026ldquo;query conflicts\u0026rdquo; occur.\nQuery Conflict Parameters hot_standby_feedback: This is the most frequently mentioned parameter in the topic of query conflicts. Let\u0026rsquo;s explore it in detail below. Suppose, without a standby, Session 1 queries a row of data, Session 2 deletes that data and commits. Then Session 2 performs a vacuum. We know this vacuum won\u0026rsquo;t delete that row because Session 1\u0026rsquo;s transaction still needs to use that tuple, so it won\u0026rsquo;t be cleaned up. What about in a primary-standby setup? How does the primary know that the standby is still querying when it\u0026rsquo;s about to perform a vacuum? This is the purpose of this parameter. After setting hot_standby_feedback, the standby will periodically notify the primary of the minimum active transaction ID (xmin) value, so the primary vacuum process won\u0026rsquo;t clean up tuples with values greater than xmin. This parameter helps reduce conflicts but cannot completely avoid them. If you think about it carefully, this parameter only reduces conflicts caused by the primary vacuuming dead tuples — it cannot resolve conflicts caused by exclusive locks. Or conflicts caused by network interruptions: if the network between primary and standby is interrupted, the standby cannot send the xmin value to the primary normally. If the interruption is long enough, the primary will still clean up useless tuples during this period, and after the network recovers, the vacuum conflict described above may occur. It\u0026rsquo;s worth noting that the hot_standby_feedback parameter won\u0026rsquo;t override the value limited by the old_snapshot_threshold parameter on the primary. The old_snapshot_threshold parameter limits the infinite expansion of dead tuples. When transaction information exceeds the old_snapshot_threshold limit, cleanup will still occur.\nmax_standby_streaming_delay: The waiting time before the standby cancels a query due to a conflict caused by receiving WAL stream logs. Setting this parameter means that when a conflict occurs, the standby query won\u0026rsquo;t be immediately canceled but will wait for a period before throwing an error if it hasn\u0026rsquo;t finished. The value can be set based on the expected runtime of potential long transactions on the standby.\nmax_standby_archive_delay: The waiting time before the standby cancels a query due to a conflict caused by processing archived WAL logs. Similar to the parameter above.\nvacuum_defer_cleanup_age: Specifies the number of transactions by which vacuum delays cleaning up dead tuples. Vacuum will delay clearing invalid records. The number of deferred transactions is set through vacuum_defer_cleanup_age. That is, vacuum and vacuum full operations won\u0026rsquo;t immediately clean up recently deleted tuples.\nYou can view conflict occurrences through the pg_stat_database and pg_stat_database_conflicts views.\nOther Related Parameters # Transmission Parameters max_wal_senders: The maximum number of services that can fetch WAL using wal sender, i.e., the maximum number of standby databases + basebackup clients. PostgreSQL 9.6 defaults to 0; PostgreSQL 10 and later default to 10. wal_send_timeout: Interrupt replication after WAL transmission fails for xx seconds. When the standby crashes or the network is interrupted for a long time, WAL will no longer attempt transmission. Default is 60. 0 means never interrupt replication. track_commit_timestamp: Record transaction timestamps. Default is off.\nPrimary Parameters synchronous_standby_names: Configured on the primary. The standby replication list. There are several forms (s1, s2, s3 represent the standby\u0026rsquo;s application_name, configured in recovery.conf): synchronous_standby_names=\u0026lsquo;s1\u0026rsquo; means the primary can commit when s1 standby returns. synchronous_standby_names=\u0026lsquo;FIRST 2 (s1,s2,s3)\u0026rsquo; means the primary can commit when the first two of the three standbys (s1 and s2) return. synchronous_standby_names=\u0026lsquo;ANY 2 (s1,s2,s3)\u0026rsquo; means the primary can commit when any two of the three standbys return. synchronous_standby_names=\u0026rsquo;\u0026rsquo; means matching any host — the primary can commit when any host returns. wal_level: WAL log level. This parameter determines how much information is written to WAL logs. The default is replica, which supports replication and WAL archiving while also supporting standby read-only queries. minimal: Other than records needed for instance crash recovery, nothing else is recorded. For example, CREATE TABLE AS, CREATE INDEX, CLUSTER, COPY can be skipped. The log information recorded in this mode is insufficient to support WAL archiving and streaming replication. logical: Adds additional information on top of replica to support logical decoding. This mode increases WAL log volume, especially for databases with many UPDATE and DELETE operations. Before PostgreSQL 9.6, there were also archive and hot_standby modes, which map to the current replica mode. synchronous_commit: As discussed earlier, 5 modes, each with pros and cons. archive_mode: archive_mode = on enables archiving. archive_command: Archiving command. PostgreSQL archiving directly calls operating system commands. Can be a simple cp command to the backup side. listen_addresses: Listening addresses. \u0026lsquo;\u0026rsquo; means listen on all IPs. Default is local.\nStandby Parameters hot_standby: on enables standby read-only queries. primary_conninfo: The connection string for the standby to connect to the primary. E.g., primary_conninfo = \u0026lsquo;host=172.17.100.150 port=5432 user=lzl password=lzl\u0026rsquo;. trigger_file/promote_trigger_file: The trigger file for activating the standby. Before PostgreSQL 12 it\u0026rsquo;s called trigger_file; PostgreSQL 12 and later use promote_trigger_file. Both trigger_file and pg_ctl promote can activate the standby with a single command, as demonstrated earlier. wal_receiver_create_temp_slot: When there is no slot, temporarily create one (named after primary_slot_name). Default is off.\nReferences: # 《The Way of PostgreSQL》(修炼之道) https://www.postgresql.org/docs/current/warm-standby.html\nhttps://www.postgresql.org/docs/13/high-availability.html\nhttps://www.postgresql.org/docs/current/runtime-config-replication.html\nhttps://www.postgresql.org/docs/13/runtime-config-wal.html\nhttps://www.postgresql.org/docs/current/app-pgbasebackup.html\nhttps://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT\nhttps://cloud.tencent.com/developer/article/1555354\nhttps://www.modb.pro/db/29737\nhttps://wiki.postgresql.org/wiki/Streaming_Replication\nhttps://www.percona.com/blog/2018/09/07/setting-up-streaming-replication-postgresql/\nhttps://www.cybertec-postgresql.com/en/the-synchronous_commit-parameter/\nhttps://blog.csdn.net/m15217321304/article/details/88850146\nhttps://blog.51cto.com/lishiyan/2460518?source=dra\n","date":"Aug 13, 2024","externalUrl":null,"permalink":"/en/2024/08/13/postgresql-streaming-replication/","section":"Posts","summary":"What is PostgreSQL Streaming Replication? # Streaming Replication is a method for transmitting WAL logs introduced in PostgreSQL 9.0. As soon as the primary database generates a log, it is immediately passed to the standby database. Before PostgreSQL 9.0, PostgreSQL could only transfer WAL logs one at a time (log shipping), and the standby database lagged behind the primary by at least one WAL log. ","title":"PostgreSQL Streaming Replication","type":"posts"},{"content":" Basic Memory Concepts # Operating system memory is very important and fairly complex. Many knowledge points need to be mastered to further analyze program issues. Since this is the first comprehensive and systematic exposure to OS memory, the goal is to understand Linux memory concepts thoroughly and at a low level without diving deep into principles, so this chapter will also try to avoid Linux source code knowledge.\nPhysical Memory and Virtual Memory # (https://en.wikipedia.org/wiki/Memory_address)\nPhysical Memory: Physical memory is the actual hardware memory present in a computer system, typically in the form of RAM (Random Access Memory).\nVirtual Memory: Virtual memory is a linear region that has not been allocated actual physical memory. Programs think they have a larger address space than the actual physical memory. The implementation of virtual memory allows programs to access a larger address range than physical memory without requiring all data to be present in physical memory simultaneously. The kernel releases physical pages by releasing linear regions, finding the corresponding physical pages, and releasing them all.\nMemory Management Unit (MMU): A hardware component responsible for converting virtual addresses used by programs into physical addresses where data is actually stored in physical memory. The MMU\u0026rsquo;s primary task is to perform address mapping.\nPage Table: A page table is a data structure used to store the mapping between virtual address space and physical address space. When a program attempts to access virtual memory, the MMU determines the corresponding physical address by querying the page table.\nSystem call flow: https://users.cs.utah.edu/~aburtsev/cs5460/lectures/lecture19-memory-management/lecture19-memory-management.pdf\n(The image is a bit blurry, the topmost text is \u0026ldquo;User Space|Kernel Space\u0026rdquo;)\nUser programs can only access the kernel system through C libraries or system calls; user programs cannot directly access the kernel system The kernel system accesses physical memory through the MMU; it accesses disks and other external devices through drivers The virtual memory system (VM Subsystem in the figure above) includes buddy, slab algorithms, etc. User Space and Kernel Space # The process virtual address space is divided into user space and kernel space.\nUser Space:\nThe space where user processes run in memory This portion of space is protected, and the system prevents other processes from accessing it (except for shared memory) However, kernel processes can directly access user processes Kernel Space:\nKernel space is the space used by kernel processes In kernel space, the operating system\u0026rsquo;s kernel code runs with higher privilege levels, allowing direct access to system hardware, process management, file system operations, etc. Context Switching:\nWhen a user program needs to access system services or perform operations requiring higher privileges, a context switch from user space to kernel space is triggered. Context switching is an operating system mechanism for saving and restoring program state, ensuring no data loss occurs when switching between user programs and the kernel. The division between user space and kernel space is to provide security isolation, preventing user programs from directly affecting critical parts of the operating system. Early operating systems and DOS did not distinguish between kernel and user space, so a single program\u0026rsquo;s error or malicious behavior could affect the entire system.\n(https://www.zhihu.com/tardis/zm/art/66794639?source_id=1003)\n32-bit systems: Total 4GB address space, 3G UserSpace | 1G KernelSpace\n64-bit systems: Total 256TB address space, 128T UserSpace | 128T KernelSpace\n2^32=4GB, 2^64=16777216TB, why does a 64-bit system only have 256TB address space?\nThe 64-bit computing wiki has an explanation. In short, 256TB (256 × 1024^4 bytes) of memory addresses is sufficient, and currently and in the imaginable future there won\u0026rsquo;t be 16EB (16 × 1024^6 bytes) of memory.\nProcess Virtual Address Space # Each process typically has its own independent virtual memory space. Virtual memory is an abstract concept that provides each running process with an address space that appears continuous and private, making each process feel like it has the entire computer system\u0026rsquo;s full memory.\nProcess virtual address space layout:\n(https://www.sohu.com/a/392831824_467784)\nThe mmap mapping region expands from top to bottom, and the mmap mapping region and heap expand relative to each other until the remaining area in the virtual address space is exhausted. This structure facilitates the C runtime library\u0026rsquo;s use of the mmap mapping region and heap for memory allocation. Stack: Stores local variables and function parameters during program execution, growing from high addresses to low addresses Heap: Dynamic memory allocation area, managed through functions like malloc, new, free, and delete BSS (Uninitialized Variables): Stores uninitialized global variables and static variables Data: Stores global variables and static variables with predefined values in source code Text: Stores read-only program execution code, i.e., machine instructions Process virtual address space distribution and mapping:\n(https://velog.io/@mysprtlty/%EA%B0%80%EC%83%81-%EB%A9%94%EB%AA%A8%EB%A6%AC%EC%99%80-%EA%B0%80%EC%83%81-%EC%A3%BC%EC%86%8C-%EA%B3%B5%EA%B0%84)\nShared Memory # As mentioned earlier, the user space in the virtual address space cannot be accessed by other user processes. If multi-process user access to the same memory data is implemented through the kernel area, context switching cannot be avoided. Multi-process applications clearly need inter-process access, so a method that directly allows user processes to access the same physical memory emerged — this is shared memory.\nShared memory is one of the mechanisms for implementing IPC (Inter Process Communication), with other methods including message queues and semaphores.\n(https://www.geeksforgeeks.org/inter-process-communication-ipc/)\nSince it is inherently multiple virtual memory address spaces corresponding to one physical memory address space, you just need to point a segment in the address spaces of two processes to the same physical memory.\n(https://www.softprayog.in/programming/interprocess-communication-using-system-v-shared-memory-in-linux)\nShared memory (seems like) has many implementation methods. For example, PostgreSQL defaults to using mmap to implement shared memory, refer to the shared_memory_type parameter and Managing Kernel Resources. Other shared memory implementations can be found in this article: Song Baohua: The Best Shared Memory in the World (The Most Thorough Linux Shared Memory Article)\nPage Table # The process virtual address space is per-process, while there is only one physical memory space. So how do you map and convert virtual memory and shared memory?\n(https://courses.engr.illinois.edu/cs241/sp2014/lecture/09-VirtualMemory_II_sol.pdf)\nThe page table is where the correspondence between virtual memory addresses and physical memory addresses is stored. (There are concepts like MMU and TLB here — let\u0026rsquo;s simplify and just think of it as the virtual-to-physical memory conversion function (PAGING), and only look at the page table here). A page table consists of a set of Page Table Entries (PTEs), with each PTE storing the map between a virtual page and a physical page.\nAlthough a single page table can implement memory-to-virtual-memory conversion, implementing it directly this way would consume too much memory for the page table itself.\n(https://courses.engr.illinois.edu/cs241/sp2014/lecture/09-VirtualMemory_II_sol.pdf)\nTherefore, the single page table needs to be subdivided: two-level page tables and four-level page tables.\nTwo-level page tables:\nA two-level page table is a further subdivision of a single page table. 4G of space requires 4M of page tables to store the mapping table. If these 4M are divided into 1K pages (4K each), these 1K pages also need a table for management, which we call the page directory table. This page directory table has 1K entries, each 4 bytes, making the page directory table size 4K as well.\nFour-level page tables:\nFor 64-bit systems, two-level page tables are insufficient; four-level page tables are needed.\n(https://maodanp.github.io/2019/06/02/linux-virtual-space/)\nCheck page table size:\n[pg@lzl 2345]$ cat /proc/meminfo |grep PageTables PageTables: 46736 kB NUMA # Uniform Memory Access (UMA): All CPUs have equivalent access time to memory. The problem with UMA is that multiple processors access memory through a single bus, increasing the load on the shared bus. Multiple processors contend for the memory controller causing conflicts. Additionally, the bus bandwidth is limited, leading to access delays.\nNon-Uniform Memory Access (NUMA): A small group of CPUs access their own local memory together. When there are multiple groups of CPUs and their memory groups, each group of CPUs and memory constitutes a NUMA node.\nUMA:\nNUMA:\n(https://users.cs.utah.edu/~aburtsev/cs5460/lectures/lecture19-memory-management/lecture19-memory-management.pdf)\nBasic NUMA characteristics:\nCPU access to local node memory is faster than remote By default, Linux prioritizes allocating local memory on the CPU; the policy can be configured Each node has its own memory structure NUMA is not suitable for all scenarios; it requires adaptation by upper-layer applications NUMA balancing: Achieves local access by automatically transferring tasks to remote CPUs or copying remote data to local memory. Enabled by default on Red Hat 7.\nTransferring tasks or copying data itself consumes resources and can slow down tasks. This feature may not be suitable for some applications; for example, Oracle\u0026rsquo;s Exadata has targeted NUMA optimizations.\nnumactl: NUMA OS configuration tool.\nnumactl --show displays CPU and node information. Below is an example of 4 nodes with 64c 256g total, each node having 16c 64g:\navailable: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 65418 MB node 0 free: 310 MB node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 1 size: 65536 MB node 1 free: 41 MB node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size: 65536 MB node 2 free: 82 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 65536 MB node 3 free: 43 MB Zone # NUMA divides CPUs and memory into multiple nodes (node 0, node 1, node 2\u0026hellip;). In UMA structures, the CPU memory as a whole can be viewed as node 0.\nIn Linux, each node is represented by the data structure struct pglist_data, with the data type typedef pg_data_t. Each node is further divided into multiple zones. A zone\u0026rsquo;s data structure is zone_t, with the data type zone_struct. There are generally 3 types: ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM, each with different functions.\n(https://www.kernel.org/doc/gorman/html/understand/understand005.html)\nZone distribution and functions in 32-bit:\nZONE_DMA: (\u0026lt;16MB), Direct Memory Access (DMA), the ancient 16 MiB limit, includes ISA devices. ZONE_DMA32: Since many devices encounter problems accessing memory that cannot be addressed with 32 bits, this zone was added in x86-64. This zone only exists in x86-64 architecture. (See ZONE_DMA32) ZONE_NORMAL: (16MB to 896MB), ordinary memory domain that can be directly mapped to the kernel segment; most kernel operations take place in the NORMAL zone, this is the most important zone ZONE_HIGHMEM: (\u0026gt;896MB), marks physical memory beyond the kernel segment, cannot be directly called by the kernel. Zone distribution diagram for 32-bit and 64-bit:\nhttps://users.cs.utah.edu/~aburtsev/cs5460/lectures/lecture19-memory-management/lecture19-memory-management.pdf\nNote that zones are for physical memory. Virtual memory must switch from user mode to kernel mode before it can call physical memory. The following diagram shows the relationship between kernel addresses in virtual memory space and zones in physical address space:\n(https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2014_2015/kp-1415-memory-management.pdf)\nInspect zones:\ncat /proc/zoneinfo\ncat /proc/buddyinfo\ncat /proc/pagetypeinfo\n$ cat /proc/buddyinfo Node 0, zone DMA 1 1 1 1 1 0 1 1 1 1 2 Node 0, zone DMA32 688 2080 1420 995 596 357 278 241 276 32 133 Node 0, zone Normal 195748 204074 161167 119070 70791 33578 9556 2070 1034 2533 7328 Node 1, zone Normal 11705 51467 36752 21326 11343 7309 5024 3403 2597 3056 10898 Pages # Virtual memory and physical memory are divided into fixed-size segments, typically 4KB in size. So after virtual memory is divided, we have virtual pages, and after physical memory is divided, we have physical pages (PP or PF, Physical Page or Page Frame), also called page frames, also 4KB. The page frame represents the minimum unit of system memory.\nEach page in the virtual address space can be mapped to a page frame in the physical address space through its descriptor.\nHuge Pages / Transparent Huge Pages # Pages are the minimum unit of memory allocation (default 4K). When mapping and allocating a large number of contiguous pages, performance is poor. Huge Pages solve this problem. Huge pages are not only cheaper to allocate, but the page table is also relatively smaller. hugepagesz is 2 MB or 1 GB, defaulting to 2MB. Huge Pages were implemented starting from Red Hat 6.\nSince manually managing huge pages is cumbersome, Red Hat 6 also provided automatic huge page management, i.e., Transparent Huge Pages.\nIn Oracle database management, huge pages are generally enabled for SGA use, while transparent huge pages are disabled. There is plenty of related material available for searching.\nSimilarly, PostgreSQL can also enable huge pages. Since databases generally occupy more operating system memory, enabling huge pages for databases can generally reduce memory allocation pressure.\nFile Pages \u0026amp; Page Cache / Anonymous Pages \u0026amp; Swap Cache # File pages can be mapped to files on disk. File system reads and writes use Page Cache as buffered IO. Dirty data is synced (or fsynced, etc.) to the corresponding disk periodically or when called. Page Cache is the memory area used to \u0026ldquo;boost\u0026rdquo; disk performance.\nCorrespondingly, pages without associated files are called Anonymous Pages, generally corresponding to heap and stack. When memory resources are tight, the kernel writes infrequently used anonymous page data to swap partitions or swap files.\nIn short:\nPage cache corresponds to file mappings Swap cache corresponds to anonymous pages (https://www.slideshare.net/raghusiddarth/memory-management-in-linux-11551521?from_search=2)\nThe above page cache diagram is from the operating system\u0026rsquo;s perspective. Application (such as database) writes can also be non-delayed, or even bypass Page Cache.\nMemory Allocation # Memory allocation is also very complex, involving many concepts. Two common memory allocation methods are buddy and slab.\nBuddy # The buddy system is used for allocating contiguous memory pages. Each zone has its own buddy system. The buddy system divides large blocks of memory to respond to memory allocation requests, and due to its coalescing characteristics, it can reduce system memory fragmentation.\nThe buddy allocator divides memory into pages of powers of 2, with the maximum order being 10:\nWhen a memory request is larger than the existing block size, the system splits the larger block into two equally sized buddy blocks. When memory is freed, the system attempts to merge adjacent buddy blocks into a larger block:\nWhen freeing a page, the page is directly placed back into the free list. If the other half of the previously split page is also unallocated, they are combined into a double-sized page and given to the next larger list, and so on, until it can no longer be merged or has reached the top.\nWhen higher-order pages are depleted due to continuous allocation, fragmentation issues arise when requesting higher-order pages: After waiting for memory reclamation to succeed, buddy itself merges lower orders into higher orders, then allocates higher-order pages:\n(The implementations of anti pages fragmentation in Linux kernel https://teawater.github.io/presentation/antif.pdf)\nHowever, memory reclamation may also not keep up with allocation speed, so the buddy system is not always ideal.\nAnalysis example:\n$ cat /proc/buddyinfo Node 0, zone DMA 0 0 0 1 2 1 1 0 1 1 3 Node 0, zone DMA32 7 6 5 6 5 6 7 7 6 2 272 Node 0, zone Normal 317681 38869 31620 19250 8931 2579 815 182 19 5 0 The above contains 3 ZONEs: DMA, DMA32, Normal Orders: 0 ~ 10, i.e., the count of each order in buddy. The maximum order of buddy is 10, i.e., 1024 pages, which is 4MB For example, the 3rd column in the Normal row indicates there are 31620 blocks of 2^2 contiguous memory available By extension, the further back, the more contiguous the space. The larger the number, the more contiguous space of that size there is. When large contiguous spaces are scarce, it indicates significant memory fragmentation Additionally, summing everything up gives the current free memory Judging memory fragmentation issues through buddyinfo:\n#host 1 Node 0, zone Normal 317681 38869 31620 19250 8931 2579 815 182 19 5 0 #host 2 Node 0, zone Normal 7321 7833 10885 8514 2311 1644 1663 1302 1141 7384 80675 The above shows the memory conditions of two hosts. Comparing them, the host below has more contiguous memory, while the host above has memory fragmentation issues.\nSlab # The slab allocator manages memory based on objects. The slab system is a memory allocation algorithm specifically designed for kernel memory. It works by dividing memory into fixed-size caches, where each slab contains a set of objects of the same type. When there is a memory request, the algorithm first checks if available objects exist in the appropriate slab cache. If they exist, the object is returned. If not, the algorithm allocates a new slab and adds it to the appropriate cache.\nObjects of different sizes correspond to different slab caches:\n(https://bootlin.com/doc/training/linux-kernel/linux-kernel-slides.pdf)\nAlthough slab has different caches and objects, slab still uses physically contiguous memory:\n(https://i.stack.imgur.com/wo8Gg.png)\nSlab also has 3 implementation methods:\nMemory Reclamation # Recommended article: Linux Forced Memory Reclamation, Linux Memory Source Code Analysis - Memory Reclamation\nMemory Reclamation Overview # When system memory pressure is high, memory reclamation is performed on each zone under pressure. Memory reclamation mainly targets anonymous pages and file pages. For anonymous pages, during memory reclamation, some infrequently used anonymous pages are selected, written to the swap partition, and then released as free page frames to the buddy system. For file pages, during memory reclamation, some infrequently used file pages are also selected: If the content saved in this file page is consistent with the corresponding file content on disk, this file page is a clean file page and does not need to be written back; it is directly released as a free page frame to the buddy system. If the data saved in the file page is inconsistent with the corresponding data in the file on disk, this file page is considered a dirty page. It must first be written back to the corresponding data location on disk, and then released as a free page frame to the buddy system. After memory reclamation completes, the number of free page frames in the system increases, alleviating memory pressure. However, the reclamation process puts significant IO pressure on the system. Therefore, a threshold is set for each zone in the system. When the number of free page frames falls below this threshold, memory reclamation operations are performed. When the number of free page frames meets this threshold, the system does not perform memory reclamation operations. Zone Watermarks and kswapd # (https://vivani.net/2022/06/14/linux-kernel-tuning-page-allocation-failure/)\nWhen available memory is low, the kswapd daemon is awakened to free pages.\npages_low: When the number of available free pages falls below pages_low, the buddy allocator wakes up the kswapd process, and the kernel begins swapping pages out to disk. pages_min: When the number of available pages reaches pages_min, the pressure of page reclamation work is relatively high because the memory zone urgently needs free pages. The allocator will execute kswapd work in a synchronous manner, sometimes called direct reclaim. pages_high: Once kswapd is awakened and begins freeing pages, the kernel considers the zone \u0026ldquo;balanced\u0026rdquo; only when the number of available pages reaches pages_high. If the watermark reaches pages_high, kswapd will re-enter the sleep state. If free pages exceed pages_high, the kernel considers the zone state ideal. Memory reclamation is performed on a per-zone basis. /proc/zoneinfo can display the values of min, low, and high.\nvm.min_free_kbytes is the min_pages watermark, a very important OS parameter. Very low values prevent the system from effectively reclaiming memory, potentially leading to system crashes and service interruptions. Too high values increase system reclamation activity, causing allocation delays, which may lead the system to immediately enter an out-of-memory state.\nTypes of Memory Allocation and Reclamation # Fast Memory Allocation: Performed by the get_page_from_freelist() function, which obtains a suitable zone from the zonelist using the low threshold for allocation. If the zone has not reached the low threshold, fast memory reclamation is performed, and allocation is retried after fast memory reclamation.\nSlow Memory Allocation: When fast allocation fails, meaning no zone in the zonelist obtained memory in fast allocation, the min threshold is used for slow allocation. During slow allocation, three main things happen: asynchronous memory compaction, direct memory reclamation, and light synchronous memory compaction. Finally, OOM allocation may occur depending on the situation. And after each of these operations, fast memory allocation is called once to attempt to obtain page frames.\n(https://blog.csdn.net/weixin_35094083/article/details/116688112)\nDifferent memory allocation paths trigger different memory reclamation methods. Zone memory reclamation is divided into two types:\nBackground Memory Reclamation (kswapd): When physical memory is tight, the kswapd kernel thread is awakened to reclaim memory. This memory reclamation process is asynchronous and does not block process execution. Direct Memory Reclamation (direct reclaim): If background asynchronous reclamation cannot keep up with process memory application speed, direct reclamation begins. This memory reclamation process is synchronous and blocks process execution. Memory Compaction # Memory compaction: see Memory Monitoring - /proc/pagetypeinfo section\nLRU # For zone memory reclamation, it targets three things for reclamation: slab, pages in LRU lists, and buffer_head. Here we only discuss memory reclamation targeting LRU lists.\nThe main purpose of LRU lists is to sort pages, placing pages most deserving of reclamation at the back and pages least deserving of reclamation at the front. Then, during memory reclamation, scanning proceeds from back to front, attempting to reclaim scanned pages.\nLRU list descriptor, containing 5 LRU lists: active/inactive anonymous page LRU lists, active/inactive file page LRU lists, and unevictable page list:\n(https://lpc.events/event/11/contributions/896/attachments/793/1493/slides-r2.pdf)\nFor memory reclamation, it only processes the first 4 LRU lists: active anonymous page LRU list, inactive anonymous page LRU list, active file page LRU list, and inactive file page LRU list. After reclaiming enough page frames, it returns directly: fast memory reclamation and kswapd memory reclamation do this.\nGlobal lruvec can be viewed through meminfo (understood as LRU areas):\n## cat /proc/meminfo |grep -i active Active: 597380 kB Inactive: 601920 kB Active(anon): 10896 kB Inactive(anon): 117376 kB Active(file): 586484 kB Inactive(file): 484544 kB In reality, there is more than one lruvec. cgroup and NUMA nodes each have their own lruvec, and global also has its own lruvec.\nDrop Cache # Drop cache records which pages are caching file system data pages and writes data back to disk when pages are forcibly reclaimed, so they can be cached again on the next access.\nDefault value: vm.drop_caches = 0. By default, the Linux kernel does not automatically clear caches.\nSetting /proc/sys/vm/drop_caches to 1: The kernel clears unused page cache.\nSetting /proc/sys/vm/drop_caches to 2: The kernel releases memory used by dentry and inode. Dentry and inode are file system metadata structures used to store file and directory information.\nSetting /proc/sys/vm/drop_caches to 3: Equivalent to 1+2, releases all unused caches.\nWhen the kernel decides to reclaim certain caches, it checks whether the data in the cache is consistent with the data on disk. If the data is inconsistent, the kernel needs to write the data back to disk before reclaiming that cache. This process can cause IO spikes. When performing Drop Cache operations, it is recommended to avoid any important I/O operations as this may affect system performance.\nOperation commands:\necho 3 \u0026gt; /proc/sys/vm/drop_caches # Flush cache echo 0 \u0026gt; /proc/sys/vm/drop_caches # Restore default Memory Monitoring # Without understanding basic memory knowledge, it is actually very difficult to interpret memory monitoring information. With the above memory fundamentals in place, let\u0026rsquo;s go through memory-related monitoring commands and tools one by one.\nWhat\u0026rsquo;s in the /proc Directory? # /proc mainly contains process information and system information.\nIn the system information part, some are interfaces provided by Linux for system status, allowing you to view monitoring information at the entire operating system level, such as slabinfo, swaps, zoneinfo, buddyinfo.\nThe other part, process, contains running data and status information for each process. cd into the corresponding process directory to see the FDs held by the corresponding process and process memory information.\nProcesses also have threads. Thread information directory: /proc/[pid]/task/[tid]/, with content similar to the process directory.\nFor more proc information, refer to proc(5) — Linux manual page\n/proc/meminfo # /proc/meminfo is the primary interface for understanding the current Linux system memory usage. The most commonly used commands like free, vmstat, ps obtain data through it. /proc/meminfo information is more comprehensive. Below we only list some common information. For detailed meanings, refer to the Red Hat documentation\n# General memory information cat /proc/meminfo | grep \u0026#34;Mem\u0026#34; MemTotal: 994328 kB # Total memory size (minus some reserved and kernel) MemFree: 66428 kB # Completely unused physical memory MemAvailable: 207192 kB # Maximum available memory for starting a new application without using swap space # IO buffers cat /proc/meminfo | grep -e \u0026#34;Buffers\u0026#34; -we \u0026#34;Cached\u0026#34; Buffers: 12820 kB # IO buffers used by raw disk blocks, not exceeding 20MB Cached: 254592 kB # Page cache size used by disks (includes tmpfs and shmem, excludes SwapCached) # swap cat /proc/meminfo | grep \u0026#34;Swap\u0026#34; SwapCached: 13936 kB # Swap cache contains anonymous memory pages determined to be swapped but not yet written to physical swap area SwapTotal: 945416 kB # Total swap space size SwapFree: 851064 kB # Remaining swap size # lru active and inactive page counts (self-explanatory) cat /proc/meminfo | grep -e \u0026#34;Active\u0026#34; -e \u0026#34;Inactive\u0026#34; Active: 194308 kB Inactive: 553172 kB Active(anon): 59024 kB Inactive(anon): 437264 kB Active(file): 135284 kB Inactive(file): 115908 kB # Dirty pages cat /proc/meminfo | grep -e \u0026#34;Dirty\u0026#34; -e \u0026#34;Writeback\u0026#34; Dirty: 0 kB # Dirty pages not yet written Writeback: 0 kB # Dirty pages being written WritebackTmp: 0 kB # Temporary buffer for writebacks used by the FUSE module # Map information cat /proc/meminfo | grep -e \u0026#34;AnonPages\u0026#34; -e \u0026#34;Map\u0026#34; AnonPages: 95296 kB # Mapped anonymous pages Mapped: 153192 kB # Mapped file pages DirectMap4k: 113336 kB # Mapped 4k kernel pages DirectMap2M: 1900544 kB # Mapped 2M kernel pages DirectMap1G: 0 kB # Mapped 1G kernel pages # Shared memory cat /proc/meminfo | grep \u0026#34;Shmem\u0026#34; Shmem: 28920 kB # Total memory size of shmem and tmpfs ShmemHugePages: 0 kB # Total huge page memory size of shmem and tmpfs ShmemPmdMapped: 0 kB # Shared memory mapped into userspace with huge pages # Kernel memory (note: slab is kernel) cat /proc/meminfo | grep -ie \u0026#34;reclaim\u0026#34; -e \u0026#34;slab\u0026#34; -e \u0026#34;kernel\u0026#34; KReclaimable: 35008 kB # Reclaimable memory allocated to kernel Slab: 88752 kB # Slab cache SReclaimable: 35008 kB # Reclaimable memory in slab cache SUnreclaim: 53744 kB # Non-reclaimable memory in slab cache KernelStack: 5988 kB # Kernel stack memory used by all tasks # Allocatable memory (different meaning from MemAvailable) ## CommitLimit=[(\u0026#34;total RAM pages\u0026#34; - \u0026#34;total huge TLB pages\u0026#34;) * overcommit_ratio]/100 + \u0026#34;total swap pages\u0026#34; ## In short, MemAvailable watermark plus swap equals allocatable memory cat /proc/meminfo | grep -ie \u0026#34;commit\u0026#34; CommitLimit: 1442580 kB # Allocatable memory Committed_AS: 3035924 kB # Estimated memory needed in current worst-case scenario # Virtual memory cat /proc/meminfo | grep -e \u0026#34;Vmalloc\u0026#34; VmallocTotal: 34359738367 kB # Total allocated virtual memory size VmallocUsed: 34780 kB # Total used virtual memory size VmallocChunk: 0 kB # Largest contiguous virtual memory block # Page table memory (self-explanatory) cat /proc/meminfo | grep PageTables PageTables: 4120 kB # Huge page memory cat /proc/meminfo | grep -i hugepage AnonHugePages: 32768 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB /proc/buddyinfo # Due to its concise and easy-to-understand information, buddyinfo is the most commonly used method for judging memory fragmentation issues. See \u0026ldquo;Memory Allocation - Buddy section\u0026rdquo; for details.\n$ cat /proc/buddyinfo Node 0, zone DMA 0 0 0 1 2 1 1 0 1 1 3 Node 0, zone DMA32 7 6 5 6 5 6 7 7 6 2 272 Node 0, zone Normal 317681 38869 31620 19250 8931 2579 815 182 19 5 0 /proc/pagetypeinfo # pagetypeinfo first provides information about page block sizes. It provides the same type of information as buddyinfo but broken down by type and detailing the number of pages of each type.\nBefore understanding pagetypeinfo, you need to first understand memory compaction.\nSuppose the memory in a zone looks like this:\nWhite represents free memory, red represents used memory. The memory fragmentation above is already quite severe. If a request for memory of order 2 or higher is made at this point, it cannot be allocated. This is where memory compaction comes into play. The compaction algorithm marks movable pages and free pages lists on the existing zone.\nThe movable scanner scans from bottom to top, and the free scanner scans from top to bottom. The movable and free scanners will eventually meet at some point in the middle. Then, through page migration, used pages are moved to the top of the zone.\nTwo trigger methods for page compaction:\nWhen allocating pages, if allocation fails at the LOW watermark, slow memory allocation is attempted, during which page compaction occurs Page compaction can be started with echo x \u0026gt; /proc/sys/vm/compact_memory. After starting, the kernel thread kcompactd begins page defragmentation. Because page data is migrated to new locations, there are no performance issues as severe as those caused by memory reclamation. Moreover, since the goal is clearer, the cost of obtaining contiguous pages is lower. Additionally, ANON page reclamation requires SWAP, while this does not.\nNow let\u0026rsquo;s look at /proc/pagetypeinfo:\n$ cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 ... (DMA omitted) Node 0, zone Normal, type Unmovable 870 530 391 157 103 41 9 2 1 0 0 Node 0, zone Normal, type Movable 5886 9235 5728 4072 1561 324 115 41 12 4 13018 Node 0, zone Normal, type Reclaimable 3 4 8 11 2 3 1 1 1 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Different pages are classified as pageblocks. Each pageblock is divided into several lists based on its type. When allocating memory, pages are requested from the corresponding list based on the requested page type, and when freed, they return to the corresponding list based on their pageblock. Different pageblocks:\nUnmovable: Pages that cannot be compacted Movable: Pages that can be compacted Reclaimable: Pages that can be reclaimed HighAtomic: Pageblock added to mitigate fragmentation issues. Only higher-order and same-level requests can request pages from this pageblock CMA: CMA stands for Contiguous Memory Allocator Isolate: Pages will not be allocated; used to help isolate pages. When isolating pages, pageblocks are first set to isolate to prevent them from being freed CMA appears to be another large topic, which can be simply understood as a supplement to the buddy system:\n(Memory Journey — How to Improve CMA Utilization? https://ost.51cto.com/posts/10815)\nsmaps \u0026amp; maps \u0026amp; pmap # VSS/RSS/PSS/USS\nWhen viewing the memory occupied by a process, there are commonly four forms: VSS/RSS/PSS/USS, mainly differing in memory calculation methodology.\n(https://cloud.tencent.com/developer/article/1683708)\nVSS (Virtual Set Size) is just a virtual space size, with little significance for actual memory usage. RSS (Resident Set Size) is used for calculating the total memory occupied by a process, including shared memory size occupied by shared libraries. For example, if private memory size is N and shared memory size is M, then RSS = N + M. This can be misleading, because for large shared libraries like libc, shared by many processes, counting it all against one process is not scientific. PSS (Proportional Set Size) is the actual physical memory occupied by a single process when running, including proportionally allocated shared library memory. If a shared library is used by N processes, the size proportionally allocated to PSS is 1/N. PSS calculates process memory more accurately, including exclusive memory plus the shared portion. USS (Unique Set Size) is the physical memory exclusively occupied by a process, not including shared memory. /proc/[pid]/maps\n/proc/[pid]/maps can view the user space memory mappings of the process\u0026rsquo;s virtual memory.\n[pg@lzl 2345]$ cat maps StartAddr-EndAddr Perms Offset Dev Inode Filename 00400000-00bae000 r-xp 00000000 fd:00 1093852 /pg/pg15.3/bin/postgres ---text segment 00dad000-00dc3000 rw-p 007ad000 fd:00 1093852 /pg/pg15.3/bin/postgres 00dc3000-00df5000 rw-p 00000000 00:00 0 00f1e000-00f60000 rw-p 00000000 00:00 0 [heap] ---heap area 33a6000000-33a6022000 r-xp 00000000 fd:00 1976006 /lib64/ld-2.17.so ... 7fbe2ae09000-7fbe2ae0a000 rw-p 0000c000 fd:00 1975966 /lib64/libnss_files-2.17.so 7fbe2ae1b000-7fbe33ca7000 rw-s 00000000 00:04 12556 /dev/zero (deleted) 7fbe33ca7000-7fbe39b38000 r--p 00000000 fd:00 1181300 /usr/lib/locale/locale-archive 7fbe39b38000-7fbe39b3d000 rw-p 00000000 00:00 0 7fbe39b46000-7fbe39b4d000 rw-s 00000000 00:10 12559 /dev/shm/PostgreSQL.3661351388 7fbe39b4d000-7fbe39b4e000 rw-s 00000000 00:04 32769 /SYSV0010c0b6 (deleted) 7fbe39b4e000-7fbe39b4f000 rw-p 00000000 00:00 0 7fffe3933000-7fffe3948000 rw-p 00000000 00:00 0 [stack] --stack area 7fffe397d000-7fffe397e000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] (1) Start-End Address: The address range of this segment in virtual memory (2) Permissions: Permissions of this segment; r-read, w-write, x-execute, p-private (3) Offset: The offset of this segment mapping in the file (4) Device: The device number of the device where the mapped file resides, corresponding to vm_file-\u0026gt;f_dentry-\u0026gt;d_inode-\u0026gt;i_sb-\u0026gt;s_dev. Anonymous mappings have 0. fd is the major device number, 00 is the minor device number. (5) Inode: Corresponds to vm_file-\u0026gt;f_dentry-\u0026gt;d_inode-\u0026gt;i_ino, matches the content displayed by ls -i, anonymous mappings have 0. (6) Mapped File Name: For named mappings, it\u0026rsquo;s the mapped file name. For anonymous mappings, it\u0026rsquo;s the role of this memory segment in the process.\nBelow is an analysis by Wenxin (it actually analyzed it correctly, this is a PostgreSQL postmaster process):\n/proc/[pid]/smaps\nThe /proc/[pid]/smaps file is an extension based on /proc/[pid]/maps, providing more detailed information than the maps file in the same directory. Each VMA has the following series of data:\n[pg@lzl 2345]$ cat smaps 00400000-00bae000 r-xp 00000000 fd:00 1093852 /pg/pg15.3/bin/postgres Size: 7864 kB --VSS memory Rss: 408 kB --RSS memory Pss: 140 kB --PSS memory Shared_Clean: 404 kB --Shared, clean memory size Shared_Dirty: 0 kB --Shared, dirty (i.e., modified) memory size Private_Clean: 4 kB --Private, clean memory size Private_Dirty: 0 kB --Private, dirty memory size Referenced: 408 kB --Current page marked as referenced or containing anonymous mappings Anonymous: 0 kB --Anonymous pages AnonHugePages: 0 kB --Anonymous huge pages Swap: 0 kB --Swapped-out memory size KernelPageSize: 4 kB --Kernel page size MMUPageSize: 4 kB --Page table page size ... 7fffe3933000-7fffe3948000 rw-p 00000000 00:00 0 [stack] Size: 88 kB Rss: 16 kB Pss: 16 kB ... Now we know that maps are the process\u0026rsquo;s memory mapping information, and smaps also includes the memory size of each mapping segment (VSS, RSS, PSS).\nYou can calculate a process\u0026rsquo;s memory usage by looking at PSS, RSS, etc. data in process smaps. Note the unit is KB.\nTotal physical memory usage of all processes:\ngrep Pss /proc/[1-9]*/smaps | awk \u0026#39;{total+=$2}; END {printf \u0026#34;%d kB\\n\u0026#34;, total }\u0026#39; PSS memory of a specific process:\ncat /proc/90875/smaps |grep Pss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; RSS memory of a specific process:\ncat /proc/68729/smaps |grep Rss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; Private memory of a specific process:\ncat /proc/90875/smaps|sed \u0026#39;/zero/,/VmFlags/d\u0026#39; |grep Private |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; pmap\nThe pmap command parses the /proc/[pid]/maps and /proc/[pid]/smaps files. It has few parameters; -x means show more information.\n[root@lzl ~]# pmap -x 2345 2345: /pg/pg15.3/bin/postgres -D /pg/1503data Address Kbytes RSS Dirty Mode Mapping 0000000000400000 7864 212 0 r-x-- postgres 0000000000dad000 88 12 12 rw--- postgres 0000000000dc3000 200 36 32 rw--- [ anon ] 0000000000f1e000 264 12 8 rw--- [ anon ] 00000033a6000000 136 108 0 r-x-- ld-2.17.so ... 00007fbe2ae09000 4 0 0 rw--- libnss_files-2.17.so 00007fbe2ae1b000 145968 4396 4396 rw-s- zero (deleted) 00007fbe33ca7000 96836 8 0 r---- locale-archive 00007fbe39b38000 20 16 16 rw--- [ anon ] 00007fbe39b46000 28 4 4 rw-s- PostgreSQL.3661351388 00007fbe39b4d000 4 0 0 rw-s- [ shmid=0x8001 ] 00007fbe39b4e000 4 4 4 rw--- [ anon ] 00007fffe3933000 84 16 16 rw--- [ stack ] 00007fffe397d000 4 4 0 r-x-- [ anon ] ffffffffff600000 4 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 268896 5532 4540 The pmap output format is similar to /proc/[pid]/maps, with one line per VMA address, but includes VSS and RSS in addition to maps, allowing you to directly see the size used by each region of the process\u0026rsquo;s virtual memory, helping to quickly determine where the regions with more memory are.\nIf the [heap] in the address space is too large, it might be a heap memory leak. For another example, if the process address space contains too many VMAs (each line in maps can be understood as a VMA), it\u0026rsquo;s likely that the application called many mmaps without munmap. Or, continuously observing changes in the address space — if certain entries are continuously growing, there\u0026rsquo;s likely an issue there.\nAnalysis Example\nFrom the host\u0026rsquo;s TOP memory view, a certain PostgreSQL backend process memory appears relatively high. Further analysis of map information is needed:\nPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 68729 postgres 20 0 5579004 5.116g 5.114g R 97.4 1.4 128:27.94 postgres: lzl: lzldb lzl 30.78.14.174(58067) DELETE Check this process\u0026rsquo;s Rss, Pss, Uss:\ncat /proc/68729/smaps |grep Rss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; 5422.67 ---5.4G Rss cat /proc/68729/smaps |grep Pss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; 467.957 ---467mb Pss cat /proc/68729/smaps|sed \u0026#39;/zero/,/VmFlags/d\u0026#39; |grep Private |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; 179.605 ---179mb Uss Rss-Uss=5.3G of shared memory. From Pss-Uss=290mb of proportional shared memory, we can roughly see that this backend is only a small portion of this shared memory proportion.\n$ pmap -x 68729 68729: postgres: pdmp: pdmpdata pdmp 30.78.14.174(46252) DELETE Address Kbytes RSS Dirty Mode Mapping 0000000000400000 6084 2444 0 r-x-- postgres 0000000000bf0000 4 4 4 r---- postgres 0000000000bf1000 52 52 52 rw--- postgres ... 00002b7f65bfa000 5441216 5365444 5365444 rw-s- zero (deleted) --this part takes the most 00002b80b1daa000 48 0 0 r-x-- libnss_files-2.17.so 00002b80b1db6000 2044 0 0 ----- libnss_files-2.17.so 00002b80b1fb5000 4 4 4 r---- libnss_files-2.17.so 00002b80b1fb6000 4 4 4 rw--- libnss_files-2.17.so 00002b80b1fb7000 24 0 0 rw--- [ anon ] 00002b80ba001000 516 516 516 rw--- [ anon ] 00007fffe16f7000 132 88 88 rw--- [ stack ] 00007fffe175b000 8 4 0 r-x-- [ anon ] ffffffffff600000 4 0 0 r-x-- [ anon ] Diving deeper into smap analysis, we can directly locate the zero (deleted) part:\n$ cat smaps 00400000-009f1000 r-xp 00000000 fd:06 58726481 /paic/postgres/base/9.6.6/bin/postgres ... 2b7f65bfa000-2b80b1daa000 rw-s 00000000 00:04 72254 /dev/zero (deleted) Size: 5441216 kB Rss: 5365444 kB Pss: 264618 kB Shared_Clean: 0 kB Shared_Dirty: 5365444 kB --shared dirty data Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 5364764 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB VmFlags: rd wr sh mr mw me ms sd From the above analysis, we can conclude: this is a PostgreSQL private process that has modified a large amount of data without flushing dirty pages. Its own private memory is not much; most is occupied in shared memory. This is likely a transaction in PostgreSQL that has modified a lot of data but hasn\u0026rsquo;t committed yet.\nAdditionally, /dev/zero (deleted) is explained in proc(5) — Linux manual page:\nAlthough these entries are present for memory regions that were mapped with the MAP_FILE flag, the way anonymous shared memory (regions created with the MAP_ANON | MAP_SHARED flags) is implemented in Linux means that such regions also appear on this directory. Here is an example where the target file is the deleted /dev/zero one:\nlrw-------. 1 root root 64 Apr 16 21:33 7fc075d2f000-7fc075e6f000 -\u0026gt; /dev/zero (deleted) \u0026ldquo;Unofficial translation\u0026rdquo;: Anonymous pages and shared pages are represented by /dev/zero (deleted).\n/proc/[pid]/status # status can view process state information, including some memory information.\n[root@lzl 2345]# cat status Name: postgres ---the command running this thread State: S (sleeping) ---process state Tgid: 2345 ---Thread group ID (i.e., Process ID) Pid: 2345 ---Thread ID PPid: 1 ---PID of parent process. ... VmPeak: 268964 kB ---virtual memory peak VmSize: 268896 kB ---virtual memory current VmLck: 0 kB VmHWM: 13400 kB ---RSS peak VmRSS: 5532 kB ---RSS current VmData: 528 kB ---data segment VmStk: 88 kB ---stack segment VmExe: 7864 kB ---text segment VmLib: 3100 kB ---shared library code segment VmPTE: 136 kB ---Page table entries VmSwap: 308 kB ---swap size Threads: 1 ---number of threads in this process .... Compared to maps, status has no mapping information. The memory data is more summarized, allowing for a more intuitive view of the size occupied by each segment of virtual memory.\nView processes with the most SWAP usage:\nfor file in /proc/*/status ; do awk \u0026#39;/VmSwap|Name|^Pid/{printf $2 \u0026#34; \u0026#34; $3}END{ print \u0026#34;\u0026#34;}\u0026#39; $file; done | sort -k 3 -n -r | head cgroup memory # cgroup memory control is now very common. Some host parameters need to be set in cgroup. Memory settings and monitoring information are under /sys/fs/cgroup/memory/.\ncginfo to view CGROUP memory allocation and usage: /opt/cgtools/cginfo -t perf -s mem\ncginfo -t perf -s mem ==================== Cgroup Performance: memory ==================== DB_TYPE INSTANCE_NAME MEM_OOM MEM_FILE_GB MEM_MAP_GB MEM_USED_GB MEM_ALLO_GB ALLO_RATE MEM_GLOB_GB GLOB_RATE ------- ------------- ------- ----------- ---------- ----------- ----------- --------- ----------- --------- postgres LZLDB 0 154.3 0.0 4.2 160.0 2.6% 375 1.1% View relatively detailed CGROUP memory usage status: /sys/fs/cgroup/memory/[group]/memory.stat\n$ cat memory.stat ... total_cache 167791534080 total_rss 4006932480 total_rss_huge 0 total_mapped_file 11747328 total_swap 0 total_pgpgin 792754417976 total_pgpgout 792712474991 total_pgfault 477971874868 total_pgmajfault 97318 total_inactive_anon 1610874880 total_active_anon 2408255488 total_inactive_file 73446166528 total_active_file 94332768256 total_unevictable 0 smem # smem is a powerful tool for displaying memory usage. It reads information from smaps, meminfo, etc. under /proc and outputs summaries. smem can output overall and specific map memory conditions, which is very intuitive and can be analyzed from different dimensions. Overall, it\u0026rsquo;s a very useful tool for analyzing memory usage.\nThe repo can be downloaded directly. Basically, just extract and use it. For more usage, refer to smem memory reporting tool. Below are just simple examples:\nView system memory usage -w:\n[root@lzl ~]# smem -w -k Area Used Cache Noncache firmware/hardware 0 0 0 kernel image 0 0 0 kernel dynamic memory 183.9M 84.0M 99.9M userspace memory 112.3M 62.2M 50.1M free memory 700.3M 700.3M 0 View memory consumption per user -u:\n[root@lzl ~]# smem -s pss -urk User Count Swap USS PSS RSS oracle 25 85.2M 30.8M 95.7M 383.0M root 93 112.4M 38.5M 42.3M 86.2M pg 12 5.9M 1.6M 2.5M 5.9M mysql 1 169.7M 1.7M 1.7M 2.0M View memory consumption for a specific user -U:\n[root@lzl ~]# smem -U pg -k PID User Command Swap USS PSS RSS 2345 pg /pg/pg15.3/bin/postgres -D 364.0K 124.0K 134.0K 228.0K 2352 pg postgres: logical replicati 636.0K 144.0K 161.0K 196.0K ... Filter a specific process -P (PROCESSFILTER, not pid):\n[root@lzl ~]# smem -P postgres -p PID User Command Swap USS PSS RSS 2346 pg /pg/pg16.0/bin/postgres -D 0.01% 0.01% 0.01% 0.01% 2350 pg postgres: walwriter 0.01% 0.01% 0.01% 0.01% ... View process mapping and memory usage -m:\n[root@lzl ~]# smem -P postgres -mpr -s pss Map PIDs AVGPSS PSS \u0026lt;anonymous\u0026gt; 13 0.02% 0.24% [heap] 3 0.07% 0.20% /usr/lib64/libpython2.6.so.1.0 1 0.11% 0.11% /pg/pg15.3/bin/postgres 6 0.01% 0.06% /pg/pg16.0/bin/postgres 6 0.01% 0.06% /dev/zero 12 0.00% 0.03% [stack] 13 0.00% 0.02% ... smem is very intuitive for viewing process USS\\PSS\\RSS. However, there is one issue: smem cannot filter by pid, only by username or PROCESSFILTER. When a host has multiple database instances deployed, filtering by parent PID or child PID is not very friendly.\ntop # top can display system running status in real time. top can be quite fancy in its usage. Running top directly can also display a lot of information.\nSorting in top:\ncommand sorted-field supported M %MEM Yes N PID Yes P %CPU Yes T TIME+ Yes You can use %MEM to sort processes with higher memory usage. %MEM represents the RES memory percentage.\ntop - 23:38:01 up 3 days, 22:32, 2 users, load average: 1.12, 1.42, 1.09 Tasks: 198 total, 13 running, 183 sleeping, 0 stopped, 2 zombie Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1020348k total, 325848k used, 694500k free, 1352k buffers Swap: 4128760k total, 635872k used, 3492888k free, 150288k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18537 oracle 20 0 636m 24m 21m S 0.0 2.4 0:05.41 oracle 18533 oracle 20 0 638m 24m 21m S 0.0 2.4 0:02.01 oracle ... 18509 oracle 20 0 634m 4384 4036 S 0.0 0.4 0:01.93 oracle 2639 root 20 0 729m 4052 1444 S 0.0 0.4 8:45.32 nautilus Memory-related interpretation:\nLine 4: Memory usage information: physical memory amount, used memory, free memory, buffer memory Line 5: Swap partition information: available swap total, used swap total, free swap total, kernel cached amount\nLine 6 (memory-related):\nVIRT: VSS RES: RSS (likely), anything occupying physical memory SHR: Shared Memory Size. It will include shared anonymous pages and shared file-backed pages %MEM: RSS percentage, a task\u0026rsquo;s currently resident share of available physical memory. Additionally, don\u0026rsquo;t forget to look at the process status when checking memory.\nS (example column 8) Process Status:\nD = uninterruptible sleep. Indicates the process is waiting for an external event to complete, such as disk I/O operations or network requests. Usually, D processes cannot be directly terminated. I = idle R = running S = sleeping T = stopped by job control signal t = stopped by debugger during trace Z = zombie The top command can see the host\u0026rsquo;s memory summary information. Process memory usage information includes RSS and SHR. A rough calculation of RES-SHR=USS can also calculate the private memory usage size. Additionally, you can see process status, so top -p to view basic memory information for a specific process is very useful.\nfree # free displays the host\u0026rsquo;s swap, total and remaining memory, all parsed from /proc/meminfo.\nuser@ubuntu:~$ free total used free shared buff/cache available Mem: 8029356 794336 6297928 183384 937092 6816804 Swap: 0 0 0 total: Total usable memory (MemTotal and SwapTotal in /proc/meminfo). This includes the physical and swap memory minus a few reserved bits and kernel binary code. used: Used or unavailable memory (calculated as total - available) free: Unused memory (MemFree and SwapFree in /proc/meminfo) shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo) buffers: Memory used by kernel buffers (Buffers in /proc/meminfo) cache: Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo). Not just pagecache, but also SReclaimable slab! buff/cache: Sum of buffers and cache available: cache includes pagecache and SReclaimable, free includes mem free and swap free; while available includes pagecache and memory about to be reclaimed. Indicates available memory, but their calculation methods differ. In practical applications, due to cache existence, available is usually larger than free. Page Cache: Page cache is primarily used as a cache for file data on the file system, especially when processes have read/write operations on files.\nBuffer Cache: Buffer cache is primarily designed for caching blocks when the system reads/writes block devices.\nps aux # The biggest advantage of ps is analyzing process status (including memory) from the process perspective. Processes with [ ] flags in the COMMAND are kernel processes.\n[pg@lzl ~]$ ps aux|head -1;ps aux|grep postgres USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND pg 2345 0.0 0.0 268896 236 ? Ss Jan01 0:03 /pg/pg15.3/bin/postgres -D /pg/1503data pg 2353 0.0 0.0 269040 196 ? Ss Jan01 0:00 postgres: checkpointer pg 2354 0.0 0.0 269032 160 ? Ss Jan01 0:02 postgres: background writer pg 2356 0.0 0.0 269032 116 ? Ss Jan01 0:01 postgres: walwriter pg 2357 0.0 0.0 270508 824 ? Ss Jan01 0:02 postgres: autovacuum launcher pg 2358 0.0 0.0 270492 620 ? Ss Jan01 0:00 postgres: logical replication launcher pg 29818 0.0 0.0 103372 868 pts/0 S+ 09:16 0:00 grep postgres VSZ and RSS units are KB. Memory information is limited; VSZ has little value, RSS can be referenced, but there\u0026rsquo;s no PSS or USS type information, so not much can be analyzed.\nipcs # ipcs -m is a command for querying IPC (Interprocess Communication) shared memory resources. It\u0026rsquo;s quite useful when analyzing shared memory.\n[pg@lzl ~]$ ipcs -m ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x0010c0b6 32769 pg 600 56 6 Shared memory key value Shared memory ID (shmid) User who created this shared memory Permissions (perms) Created size (bytes) Number of processes attached to this shared memory (nattach) Shared memory status When connecting a session to PostgreSQL, one more backend process appears:\n------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x0010c0b6 32769 pg 600 56 7 nattch+1, indicating that the private backend process also shares a portion of the PG shared memory. At this point, the following diagram is understood more deeply:\n(http://gauss.ececs.uc.edu/Courses/c4029/code/memory/virtual.pdf)\nvmstat # vmstat is an abbreviation for Virtual Memory Statistics, and can monitor the operating system\u0026rsquo;s virtual memory, processes, and CPU activity. It provides statistics on the overall system situation; the shortcoming is that it cannot perform in-depth analysis of a specific process.\nUseful parameter explanations:\nvmstat [options] [delay [count]] OPTIONS: -a Display active and inactive memory -m Display slabinfo -s Display memory-related statistics and various system activity counts -t Append timestamp to each line -w Wide output mode. Without w, the output is narrow, reducing alignment issues -bash-4.1$ vmstat -w 1 3 procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 661652 763348 324 76100 15 12 54 21 18 45 0 0 79 21 0 2 1 661652 763340 304 75764 0 0 0 32 12 84 0 1 0 99 0 41 1 661652 760744 244 78300 228 0 3216 0 265 442 0 0 0 100 0 pidstat # pidstat is a command from the sysstat tool, used to monitor all or specified processes\u0026rsquo; CPU, memory, threads, device IO, and other system resource usage.\nUseful parameter explanations:\npidstat OPTIONS interval [ count ] -d :Report I/O statistics -u :Report CPU utilization -r :Report page faults and memory utilization -w :Report task switching activity -p :pid[,...] -l :Display the process command name and all its arguments. View memory status of a specific process:\n-bash-4.1$ pidstat -r -l -p 2345 Linux 2.6.32-431.el6.x86_64 (lzl) 01/06/2024 _x86_64_ (1 CPU) 02:48:32 PM PID minflt/s majflt/s VSZ RSS %MEM Command 02:48:32 PM 2345 0.23 0.00 268896 240 0.02 /pg/pg15.3/bin/postgres -D /pg/1503data Various indicators are relatively easy to understand. VSZ, RSS — tired of talking about them.\nminflt/s: Abbreviation for \u0026ldquo;minor page faults\u0026rdquo;, indicating the number of \u0026ldquo;minor page faults\u0026rdquo; that occur per second. A page fault occurs when a program tries to access a page that is not in physical memory. If the page is indeed in the swap area on disk, this is a minor page fault. majflt/s: Abbreviation for \u0026ldquo;major page faults\u0026rdquo;, indicating the number of \u0026ldquo;major page faults\u0026rdquo; that occur per second. Unlike minor page faults, major page faults occur when a program tries to access a page that is not in physical memory and is also not in the swap area on disk. sar # sar (System Activity Reporter) is currently one of the most comprehensive system performance analysis tools on Linux. It can report on various aspects of system activity, including: file read/write status, system call usage, disk I/O, CPU efficiency, memory usage, process activity, and IPC-related activity. The SAR tool is part of the sysstat software package.\n(https://www.brendangregg.com/Perf/linux_observability_sar.png)\nsar is very powerful. The man parameter introduction alone has over 1k lines. This article cannot possibly explain everything (being lazy).\nMemory-related parameters:\nsar OPTIONS interval [ count ] -B :Report paging statistics -r :Report memory utilization statistics -W :Report swapping statistics. -H :Report hugepages utilization statistics -s [ start_time ] ] [ -e [ end_time ] Example: sar view memory utilization sar -r 1 3\nkbmemfree: This value is basically consistent with the free value in the free command, so it does not include buffer and cache space kbmemused: This value is basically consistent with the used value in the free command, so it includes buffer and cache space %memused: This value is kbmemused as a percentage of total memory (excluding swap) kbbuffers: buffer in the free command kbcached: cache in the free command kbcommit: Memory needed to guarantee the current system, i.e., memory needed to ensure no overflow (RAM + swap) %commit: This value is kbcommit as a percentage of total memory (including swap) Example: sar view memory page status sar -B 1 3\npgpgin/s: Kilobytes paged in from disk or SWAP to memory per second pgpgout/s: Kilobytes paged out from memory to disk or SWAP per second fault/s: Number of page faults per second, i.e., sum of major and minor faults majflt/s: Number of major faults per second pgfree/s: Number of pages placed on the free queue per second pgscank/s: Number of pages scanned by kswapd per second pgscand/s: Number of pages directly scanned per second pgsteal/s: Number of pages reclaimed from cache to meet memory needs per second %vmeff: Pages stolen (pgsteal) as a percentage of total scanned pages (pgscank + pgscand) per second Example: sar view swap information sar -W 1 3\nReport explanation:\npswpin/s: Number of swap pages swapped in per second pswpout/s: Number of swap pages swapped out per second Example: sar view historical memory information sar -B -s \u0026quot;08:00:00\u0026quot; -e \u0026quot;10:00:00\u0026quot;\n#Without -e, it shows information from the start time to now $ sar -B -s \u0026#34;08:00:00\u0026#34; 09:45:01 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 09:46:01 PM 414429.37 395024.08 179478.63 0.07 352922.62 12003.78 4266.52 16269.42 99.99 09:47:01 PM 879907.08 337948.43 157970.97 0.02 402290.21 0.00 0.00 0.00 0.00 09:48:01 PM 772977.43 507343.30 150255.50 0.05 466742.08 0.00 5821.28 5821.27 100.00 Above, pgscank represents the speed at which the kswapd process intervenes in memory reclamation, and pgscand represents the speed of direct memory reclamation.\ngcore # gcore is part of gdb and can generate a core dump file for a process.\nExample: dump a PostgreSQL backend process:\n[root@lzl ~] ps -ef|grep 8296 pg 8296 2345 0 09:41 ? 00:00:00 postgres: pg lzldb [local] idle [root@lzl ~] cat /proc/8296/smaps |grep Pss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; 0.351562 [root@lzl ~] cat /proc/8296/smaps |grep Rss |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; 0.445312 [root@lzl ~] cat /proc/8296/smaps|sed \u0026#39;/zero/,/VmFlags/d\u0026#39; |grep Private |awk \u0026#39;{sum+=$2 };END {print sum/1024}\u0026#39; 0.0078125 Process 8296\u0026rsquo;s USS is only 7.8 KB, RSS 445 KB. Dump memory:\ngcore -o /tmp/dump 8296 Dumping takes some time, and the dumped file is relatively large, and it will hang the process.\n[root@lzl 8296]# ls -lh /tmp/dump.8296 -rw-r--r-- 1 root root 252M Jan 7 10:59 /tmp/dump.8296 gdb # gdb can view specific locations and content in memory.\nExample: view PostgreSQL backend cached data:\nOpen a new session to query a partitioned table, keeping the session open: [pg@lzl ~]$ psql psql (15.3) Type \u0026#34;help\u0026#34; for help. postgres=\u0026gt; \\c lzldb You are now connected to database \u0026#34;lzldb\u0026#34; as user \u0026#34;pg\u0026#34;. lzldb=\u0026gt; select * from lzlpartition limit 1; appl_no | is_deleted | date_created | date_updated ---------+------------+--------------+-------------- (0 rows) Use pmap, smaps to view process memory usage and find the memory segment to dump: [root@lzl 13393]# pmap -x 13393 13393: postgres: pg lzldb [local] idle Address Kbytes RSS Dirty Mode Mapping 0000000000400000 7864 1204 0 r-x-- postgres .. 00007fbe2ae1b000 145968 2164 176 rw-s- zero (deleted) ---RSS takes the most here 00007fbe33ca7000 96836 0 0 r---- locale-archive 00007fbe39b38000 20 0 0 rw--- [ anon ] 00007fbe39b46000 28 0 0 rw-s- PostgreSQL.3661351388 00007fbe39b4d000 4 0 0 rw-s- [ shmid=0x8001 ] 00007fbe39b4e000 4 0 0 rw--- [ anon ] 00007fffe3933000 84 36 0 rw--- [ stack ] 00007fffe397d000 4 4 0 r-x-- [ anon ] ffffffffff600000 4 0 0 r-x-- [ anon ] [root@lzl 13393]# cat /proc/13393/smaps |grep -A 13 zero 7fbe2ae1b000-7fbe33ca7000 rw-s 00000000 00:04 12556 /dev/zero (deleted) Size: 145968 kB Rss: 2164 kB Pss: 2164 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 1988 kB Private_Dirty: 176 kB Referenced: 2164 kB Anonymous: 0 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB gdb dump memory: The starting position for dumping memory is the vm address in smaps + 0x:\n[pg@lzl tmp]$ gdb (gdb) attach 13393 (gdb) dump memory /tmp/delete.dump 0x7fbe2ae1b000 0x7fbe33ca7000 View the dump file: You can simply view it through strings:\n[root@lzl 13393]# strings /tmp/delete.dump|grep lzl|sort|uniq ... @lzlpartition_202301 lzlpartition_202301 lzlpartition_202301_appl_no_idx lzlpartition_202301_date_created_idx ... lzlpartition_202306 lzlpartition_202306_appl_no_idx lzlpartition_202306_date_created_idx @lzlpartition_attach lzlpartition_attach @nk_lzlpartition nk_lzlpartition select * from lzlpartition limit 1; As long as the session queries a partitioned table, all partition and index metadata is cached in the backend process.\nNote:\ngdb attach [pid] will hang the process; do not execute casually The dump file size equals VSS, generally much larger than RSS/PSS/USS Memory Summary # References # Easily Break Through File I/O Bottlenecks: Memory-Mapped mmap Technology https://blog.51cto.com/u_15481245/6582927\nStep by Step with Diagrams: Deep Understanding of Linux Physical Memory Management https://cloud.tencent.com/developer/article/2352771?areaId=106001\nSystematically Learning Memory Management from a DBA\u0026rsquo;s Perspective https://mp.weixin.qq.com/s/CybzGP44dVWQN5hfFrVx7A\nhttps://linux2me.wordpress.com/2017/09/15/linux-introduction-to-memory-management/\nMemory management in Linux https://www.slideshare.net/raghusiddarth/memory-management-in-linux-11551521?from_search=2\nLinux Performance Tunning Memory https://www.slideshare.net/shayc1/linux-performance-tunning-memory?from_search=4\nHow to Learn the Linux Kernel (Memory Chapter) https://mp.weixin.qq.com/s/lKKHH1MMiZbnIbDQt3-IAQ\nhttps://courses.engr.illinois.edu/cs241/sp2014/lecture/09-VirtualMemory_II_sol.pdf\nLinux Process Virtual Address Space https://maodanp.github.io/2019/06/02/linux-virtual-space/\nRed Hat Official Documentation https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/chap-virtualization_tuning_optimization_guide-numa\nData Processing on Modern Hardware https://db.in.tum.de/teaching/ss21/dataprocessingonmodernhardware/MH_8.pdf?lang=de\nChapter 2 Describing Physical Memory https://www.kernel.org/doc/gorman/html/understand/understand005.html\nVarious command man pages\nLinux Forced Memory Reclamation, Linux Memory Source Code Analysis - Memory Reclamation (Overall Process) https://blog.csdn.net/weixin_35094083/article/details/116688112\n\u0026lt;Memory compaction https://lwn.net/Articles/368869/\u003e\nMemory Journey — How to Improve CMA Utilization? https://ost.51cto.com/posts/10815\nThe implementations of anti pages fragmentation in Linux kernel https://teawater.github.io/presentation/antif.pdf\nT H E /proc F I L E S Y S T E M https://www.kernel.org/doc/Documentation/filesystems/proc.txt\nThe /proc/meminfo File in Linux https://www.baeldung.com/linux/proc-meminfo\nthe proc filesystem https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-proc-meminfo\nIntroduction and Usage of Linux /proc/{pid}/maps (Locating Memory Leaks) https://blog.csdn.net/mijichui2153/article/details/123934531\nCPU and Memory Usage in Linux top Command https://blog.csdn.net/weixin_45030965/article/details/127693042\nsmem memory reporting tool https://www.selenic.com/smem/\nLinux performance optimization https://feiyang233.club/post/linux/\ngdb onlinedocs https://sourceware.org/gdb/current/onlinedocs/gdb\nLinux_Core_Dumps https://averageradical.github.io/Linux_Core_Dumps.pdf\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/a-brief-analysis-of-linux-memory/","section":"Posts","summary":"Basic Memory Concepts # Operating system memory is very important and fairly complex. Many knowledge points need to be mastered to further analyze program issues. Since this is the first comprehensive and systematic exposure to OS memory, the goal is to understand Linux memory concepts thoroughly and at a low level without diving deep into principles, so this chapter will also try to avoid Linux source code knowledge.\n","title":"A Brief Analysis of Linux Memory","type":"posts"},{"content":" FDW Basic Concepts # What is SQL/MED? # SQL/MED aims to unify access methods for heterogeneous data sources. In 2003, SQL/MED was added to the ISO/IEC 9075-9 standard, defined as a SQL standard extension for managing external data via foreign-data wrappers (FDW) or datalink (such as Oracle or PG\u0026rsquo;s dblink). In short, SQL/MED is an international SQL extension standard. Many databases already support SQL/MED, such as DB2, MariaDB, PG, and more.\nWithout SQL/MED, applications must access required data sources themselves and process data at the application layer:\nWith SQL/MED, the data access architecture becomes clearer:\nHowever, while this architecture diagram appears simpler, it increases the database\u0026rsquo;s IO and computation pressure. This goes against the modern trend of decoupling computation from the database to the application layer.\nOf course, both approaches have their pros and cons, and SQL/MED is still used in certain scenarios.\nSQL/MED exists as a standard, and PostgreSQL supports the SQL/MED standard excellently through FDW.\nWhat is FDW? # PostgreSQL has supported FDW since version 9.1. Users can access external data (foreign data) through regular SQL statements. Foreign data is accessed via a foreign data wrapper (FDW). The FDW in PostgreSQL is itself a library — because different external data sources correspond to different FDW extensions, we often call it an FDW plugin.\nPG\u0026rsquo;s FDW functionality is extremely powerful: it not only supports multiple data sources but also optimizes data access, and can even be used for \u0026ldquo;beyond expectations\u0026rdquo; purposes, such as implementing cluster functionality.\nInstallation and Download # Basically every type of database and data format has its own FDW plugin: oracle_fdw for Oracle databases, mysql_fdw for MySQL databases, and so on. FDW plugins can be installed directly or downloaded:\nFDWs already included as extensions: file_fdw, postgres_fdw, cstore_fdw Other FDW plugins can be downloaded from PGXN or the wiki, such as: oracle_fdw, mysql_fdw, json_fdw. Be sure to read the README carefully to understand each FDW\u0026rsquo;s limitations and usage rules. FDW plugin download: https://pgxn.org/tag/fdw/ More FDWs (mostly beta): https://wiki.postgresql.org/wiki/Foreign_data_wrappers Write your own FDW: https://www.postgresql.org/docs/current/fdwhandler.html Advantages of FDW over dblink in PG # PG also has dblink. FDW and dblink are functionally similar — both access external tables. But FDW has more advantages:\nFDW supports many more data sources (a LOT more). dblink only supports PostgreSQL databases, equivalent to just one FDW plugin — postgres_fdw (which is actually much more powerful). Transparent to developers. External tables can be accessed just like regular tables. More compliant with standard SQL syntax. Better performance in many scenarios. The functionality provided by this module overlaps substantially with the functionality of the older dblink module. But postgres_fdw provides more transparent and standards-compliant syntax for accessing remote tables, and can give better performance in many cases.\nIn summary, FDW is stronger than the dblink plugin — you can basically forget about dblink.\nFDW\u0026rsquo;s Four Objects # Different FDWs have different usage patterns, but generally all require creating 4 objects: foreign data wrapper, server, user mapping, foreign table. Some objects are not mandatory — for example, file_fdw doesn\u0026rsquo;t need a user mapping, while relational database FDWs generally require one.\nforeign data wrapper # After creating the corresponding FDW extension with CREATE EXTENSION, the foreign data wrapper is automatically created.\nFor example, creating a file_fdw extension:\n=# create extension file_fdw; CREATE EXTENSION =# \\dx Name | Version | Schema | Description --------------------+---------+------------+------------------------------------------------------------------------ file_fdw | 1.0 | public | foreign-data wrapper for flat file access ## select * from information_schema.foreign_data_wrappers; foreign_data_wrapper_catalog | foreign_data_wrapper_name | authorization_identifier | library_name | foreign_data_wrapper_language ------------------------------+---------------------------+--------------------------+--------------+------------------------------- postgres | file_fdw | postgres | [null] | c You can also create a foreign data wrapper manually without using an extension. See CREATE FOREIGN DATA WRAPPER.\nserver # CREATE SERVER creates an external service, essentially specifying the data source. The OPTIONS syntax varies by foreign-data wrapper — for example, the OPTION syntax for file_fdw and postgres_fdw is definitely different. At this point, you need to read the FDW plugin\u0026rsquo;s README or official documentation. For example:\nCreate a file_fdw external service named fileserver:\nCREATE SERVER fileserver FOREIGN DATA WRAPPER file_fdw; Create a postgres_fdw external service named pgserver, pointing to the lzldb database on a PG instance at 172.0.0.1:5432:\nCREATE SERVER pgserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host \u0026#39;172.0.0.1\u0026#39;, dbname \u0026#39;lzldb\u0026#39;, port \u0026#39;5432\u0026#39;); View servers:\n=# select * from information_schema.foreign_servers; foreign_server_catalog | foreign_server_name | foreign_data_wrapper_catalog | foreign_data_wrapper_name | foreign_server_type | foreign_server_version | authorization_identifier ------------------------+---------------------+------------------------------+---------------------------+---------------------+------------------------+-------------------------- postgres | pgserver | postgres | postgres_fdw | [null] | [null] | postgres postgres | fileserver | postgres | file_fdw | [null] | [null] | postgres user mapping # User mapping defines the correspondence between external service users and local users. Therefore, relational database FDWs generally have user mappings, while file-type FDWs without user definitions don\u0026rsquo;t need them.\nFor example, create a user mapping using the pgserver from above:\nCREATE USER MAPPING FOR localuser SERVER pgserver OPTIONS (user \u0026#39;remoteuser\u0026#39;, password \u0026#39;mypasswd\u0026#39;); View user mappings:\n=# select * from information_schema.user_mappings; authorization_identifier | foreign_server_catalog | foreign_server_name --------------------------+------------------------+--------------------- localuser | lzldb | pgserver foreign table # Foreign tables map remote tables locally, allowing them to be accessed like regular tables. Since local objects are involved and there are many OPTIONS, the full syntax is somewhat complex. See CREATE FOREIGN TABLE. Simply put, you create a locally corresponding remote table.\nTwo common ways to create foreign tables: creation and import.\nCreate a foreign table:\nCREATE FOREIGN TABLE localtable ( id char(5) NOT NULL, name varchar(40) NOT NULL ) SERVER pgserver OPTIONS (table_name \u0026#39;remotetable\u0026#39;); Creating foreign tables one by one is tedious — you can import all tables from a remote schema at once:\nIMPORT FOREIGN SCHEMA remoteschema FROM SERVER pgserver INTO localschema; View foreign tables:\ninformation_schema.foreign_tables; -- Intuitive view of foreign tables pg_foreign_server; -- Less intuitive, but shows OPTION settings Using FDW # Viewing Foreign Table Information # psql\u0026rsquo;s built-in shortcuts are quite clear for viewing the 4 objects of foreign tables, but pay attention to search_path settings:\npsql command Meaning \\des list foreign servers \\deu list user mappings \\det list foreign tables \\dtE list both local and foreign tables Foreign table object views/tables can be messy — here\u0026rsquo;s a quick organization:\nforeign data wrapper tables/views Meaning information_schema._pg_foreign_data_wrappers More complete information information_schema.foreign_data_wrappers Less information information_schema.foreign_data_wrapper_options Targeted query of foreign data wrapper options pg_foreign_data_wrapper Slightly less info, but has permission info that other views lack foreign server tables/views Meaning information_schema._pg_foreign_servers More complete information information_schema.foreign_servers Less information information_schema.foreign_server_options Targeted option query — one record per option, not per server pg_foreign_server Less information, base table user mapping tables/views Meaning information_schema._pg_user_mappings Fairly complete user mapping information information_schema.user_mappings Less information information_schema.user_mapping_options Targeted query of UM options pg_user_mappings Slightly less than _pg_user_mappings. Viewable by unprivileged users — passwords show as null pg_user_mapping Less information, base table, mainly options. Inaccessible to unprivileged users foreign table tables/views Meaning information_schema._pg_foreign_tables More complete, shows all foreign tables information_schema._pg_foreign_table_columns Shows column-to-column mappings information_schema.foreign_table_options Targeted display of foreign table options foreign_tables Less information, base table These views/tables look messy but actually have a clear structure. The 4 object types all follow the same data dictionary pattern:\npg_xxx are base tables, the foundational information source for the 4 objects information_schema._pg_xxx joins pg_xxx base tables with other info — it\u0026rsquo;s a summary view with comprehensive information information_schema.xxx is a view on information_schema._pg_xxx, with less information information_schema.xxx_options provides targeted option information, sourced only from the full view information_schema._pg_xxx A special view: pg_user_mappings, usable even by unprivileged users Permission Considerations # If you use the postgres superuser throughout to create foreign tables, you\u0026rsquo;ll rarely encounter issues. But in production, application users are typically not superusers. Therefore, permissions are extremely important — not only important but also quite troublesome. Using a regular user for testing is crucial (as with any testing). PG\u0026rsquo;s permission system is like a boss battle — missing any link won\u0026rsquo;t work.\nKey permission points:\nForeign data wrapper, server, and user mapping owners are their creators. Users must be granted USAGE privilege or be the owner themselves to use them. Accessing remote data sources requires users with appropriate permissions — specified in the user mapping step with suitable remote login credentials. After creating/importing foreign tables locally, these objects are treated as local objects (only the data dictionary is local). So PG\u0026rsquo;s local object access permission system must also be properly configured. FDW Usage Examples # There are hundreds of FDW implementations for various data sources worldwide — relational databases, NoSQL databases, various file types, Web Services, columnar storage, big data, and more. Here are a few common FDWs.\nUsing postgres_fdw # This is probably the most commonly used and most powerful FDW. It allows accessing external PostgreSQL databases from a local database. It can also be used for self-access — this is important because: PostgreSQL cannot access across databases internally! To solve this problem, a good approach is using FDW for cross-database access within the same instance — accessing yourself through an external connection.\nHere\u0026rsquo;s an example of cross-database access using postgres_fdw:\nAn instance has two databases: aka and bkb. You can\u0026rsquo;t query both databases in a single SQL statement — databases in PG are logically isolated, somewhat like Oracle 12c PDBs.\n[lzl@postgres]=# \\l aka | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres + | | | | | postgres=CTc/postgres bkb | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres + | | | | | postgres=CTc/postgres Although both databases are local, when using FDW we still need the local/remote database concept. Here we treat aka as the local database and bkb as the remote database, enabling access to bkb\u0026rsquo;s tables from aka while handling permission issues.\n1. Install FDW plugin\n\\c aka create extension postgres_fdw; Note: Extensions are database-level — switch to the local database first.\n2. Grant user permissions\ngrant usage on foreign data wrapper postgres_fdw to akadata; 3. Create server\n\\c aka akadata CREATE SERVER bkb_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host \u0026#39;127.0.0.1\u0026#39;, port \u0026#39;5432\u0026#39;, dbname \u0026#39;bkb\u0026#39;); 4. Create user mapping\nCREATE USER MAPPING FOR akadata SERVER bkb_server OPTIONS (user \u0026#39;bkbdata\u0026#39;, password \u0026#39;bkbpasswd\u0026#39;); 5. Create schema in aka database, grant to akadata user\n\\c aka postgres create schema bkb; grant usage on schema bkb TO akadata; --GRANT select ON ALL TABLES IN SCHEMA bkb TO akadata; grant all privileges on schema bkb TO akadata; 6. Import bkb tables\n\\c aka akadata Import entire schema:\nIMPORT FOREIGN SCHEMA public FROM SERVER bkb_server INTO bkb; Import a single table:\nIMPORT FOREIGN SCHEMA public LIMIT TO (tab1) FROM SERVER bkb_server INTO bkb 7. View foreign tables\n=# select * from information_schema.foreign_tables; foreign_table_catalog | foreign_table_schema | foreign_table_name | foreign_server_catalog | foreign_server_name -----------------------+----------------------+-------------------------------------+------------------------+--------------------- aka | bkb | tab1 | aka | bkb_server Using file_fdw # The file_fdw extension provides PG with read-only access to external files. file_fdw is already in contrib and can be installed with CREATE EXTENSION. External files must conform to COPY rules.\nHere\u0026rsquo;s a classic example of mapping PG output logs to a foreign table, script from the official documentation:\n1. Create file_fdw extension\nCREATE EXTENSION file_fdw; 2. Create external server\nCREATE SERVER fileserver FOREIGN DATA WRAPPER file_fdw; 3. Create foreign table\nCREATE FOREIGN TABLE pglog ( log_time timestamp(3) with time zone, user_name text, database_name text, process_id integer, connection_from text, session_id text, session_line_num bigint, command_tag text, session_start_time timestamp with time zone, virtual_transaction_id text, transaction_id bigint, error_severity text, sql_state_code text, message text, detail text, hint text, internal_query text, internal_query_pos integer, context text, query text, query_pos integer, location text, application_name text ) SERVER fileserver OPTIONS ( filename \u0026#39;pg_log/postgresql-07-06.csv\u0026#39;, format \u0026#39;csv\u0026#39; ); 4. Query the log table\n=# select user_name,database_name,process_id,error_severity,message from pglog where error_severity\u0026lt;\u0026gt;\u0026#39;LOG\u0026#39;; user_name | database_name | process_id | error_severity | message -----------+---------------+------------+----------------+----------------------------------------------- appuser1 | db1 | 102349 | ERROR | value too long for type character varying(20) appuser1 | db1 | 55378 | ERROR | value too long for type character varying(20) appuser2 | db2 | 219377 | ERROR | relation \u0026#34;dual\u0026#34; does not exist Deep Dive into postgres_fdw # postgres_fdw Performance Optimization # Unlike most FDW plugins, postgres_fdw is an official plugin maintained by the PostgreSQL Global Development Group, with its source code in contrib. Because external services differ in functionality and structure, some features — such as obtaining remote database access costs or aggregate pushdown in certain scenarios — are difficult to implement in other FDWs. But in postgres_fdw they\u0026rsquo;re achievable. The official team has done extensive optimization for postgres_fdw, making it extremely powerful.\nSQL Execution Process # The parser generates a query tree from the foreign table definition. The planner connects to the foreign server. Obtain cost information. If use_remote_estimate is true (default), the planner executes EXPLAIN on the remote database to get access costs (step 3); if false, it calculates locally instead. Deparse generates remote SQL text. FDW accesses remote database objects by sending SQL text — the planner generates SQL text for remote execution. The Remote SQL part of the execution plan directly shows the deparsed SQL: =\u0026gt; explain (verbose) select a from bkb.tab1 where a=1; QUERY PLAN ----------------------------------------------------------------- Foreign Scan on bkb.tab1 (cost=100.00..146.86 rows=15 width=4) Output: a Remote SQL: SELECT a FROM public.tab1 WHERE ((a = 1)) Send SQL statement and receive data. The remote database executes the SQL independently and returns results to the local database based on fetch_size (default 100 rows). Cost Estimation # postgres_fdw can pass remote database object access costs to the local database for calculating the overall SQL execution plan cost. However, simply returning the remote estimated cost isn\u0026rsquo;t enough — the cost of remote access itself must also be considered. postgres_fdw provides 3 OPTIONS to adjust foreign table cost estimation:\nuse_remote_estimate: When set to true, the planner runs EXPLAIN on the remote database to get estimated costs, adding fdw_startup_cost and fdw_tuple_cost. When false (default), the planner calculates locally and adds fdw_startup_cost and fdw_tuple_cost. Local foreign table statistics may differ from actual values.\nfdw_startup_cost: Startup cost for foreign tables, default 100. Represents the cost of establishing a connection, parsing, and generating a plan on the external service.\nfdw_tuple_cost: Additional cost per tuple scanned from a foreign table, default 0.01. Represents data transfer cost — higher latency should mean higher settings.\nAggregate Pushdown # Aggregate pushdown executes computations on the remote database, with the local database directly receiving the remote execution results. Without aggregate pushdown, all data must be returned to the local database for computation, increasing data transfer\u0026rsquo;s impact on SQL execution efficiency and the local database\u0026rsquo;s computational burden.\n(In this environment, bkb. are all foreign tables, local tables are public.)\nPredicate Pushdown: postgres_fdw supports WHERE pushdown — no need to return all data to the local database.\n=\u0026gt; explain (verbose,costs off) select f1.a from bkb.tab1 f1 where f1.a=1; QUERY PLAN --------------------------------------------------------- Foreign Scan on bkb.tab1 f1 Output: a Remote SQL: SELECT a FROM public.tab1 WHERE ((a = 1)) Sort Pushdown: postgres_fdw supports sort pushdown, sending sorts to the remote database.\n=\u0026gt; explain (verbose,costs off) select f1.a from bkb.tab1 f1 order by 1 desc nulls first; QUERY PLAN --------------------------------------------------------------------- Foreign Scan on bkb.tab1 f1 Output: a Remote SQL: SELECT a FROM public.tab1 ORDER BY a DESC NULLS FIRST Join Pushdown: Some joins cannot be pushed down, like local table JOIN foreign table — only the foreign table results can be brought locally for joining.\n=\u0026gt; explain (verbose,costs off) select f1.a,l2.a from bkb.tab1 f1,tab1 l2 where f1.a=l2.a; QUERY PLAN ----------------------------------------------------- Hash Join Output: f1.a, l2.a Hash Cond: (l2.a = f1.a) -\u0026gt; Seq Scan on public.tab1 l2 Output: l2.a, l2.b -\u0026gt; Hash Output: f1.a -\u0026gt; Foreign Scan on bkb.tab1 f1 Output: f1.a Remote SQL: SELECT a FROM public.tab1 When both tables are foreign tables, joins can be pushed down to the remote database:\n=\u0026gt; explain (verbose,costs off) select f1.a,f1.b from bkb.tab1 f1 left join bkb.tab2 f2 on f1.a=f2.a; QUERY PLAN ----------------------------------------------------------------------------------------------------- Foreign Scan Output: f1.a, f1.b Relations: (bkb.tab1 f1) LEFT JOIN (bkb.tab2 f2) Remote SQL: SELECT r1.a, r1.b FROM (public.tab1 r1 LEFT JOIN public.tab2 r2 ON (((r1.a = r2.a)))) Aggregate Function Pushdown: Supports pushing down aggregate functions — functions must be IMMUTABLE.\n=\u0026gt; explain (verbose,costs off) select b,count(*),avg(a) from bkb.tab1 group by b; QUERY PLAN ---------------------------------------------------------------------------- GroupAggregate Output: b, count(*), avg(a) Group Key: tab1.b -\u0026gt; Foreign Scan on bkb.tab1 Output: a, b Remote SQL: SELECT a, b FROM public.tab1 ORDER BY b ASC NULLS LAST Some scenarios aren\u0026rsquo;t supported, such as HAVING clauses that can only filter locally:\n=\u0026gt; explain (verbose,costs off) select b,count(*) from bkb.tab1 group by b having count(*)\u0026gt;=2; QUERY PLAN ------------------------------------------------------------------------- GroupAggregate Output: b, count(*) Group Key: tab1.b Filter: (count(*) \u0026gt;= 2) -\u0026gt; Foreign Scan on bkb.tab1 Output: a, b Remote SQL: SELECT b FROM public.tab1 ORDER BY b ASC NULLS LAST Other Features # Remote Execution OPTION Settings # extensions: User-specified FDW extensions that can use \u0026ldquo;remote computation\u0026rdquo;. Can only be set at the server level.\nfetch_size: Number of rows fetched per batch from the remote database, default 100. Can be set at server or table level.\nupdatable: By default, postgres_fdw foreign tables are updatable. The updatable option can control this. If a foreign table is inherently non-updatable, setting updatable to false at the table level causes errors directly locally.\ntruncatable: Starting from PG14, postgres_fdw supports truncating foreign tables, controlled by the truncatable option, defaulting to true.\nConnection Management # On the first foreign table access in a session, a connection to the remote database is established. As long as the local session hasn\u0026rsquo;t disconnected, this connection is reused. If multiple user mappings are used, a connection is established for each user mapping.\nStarting from PG14, the keep_connections option controls this behavior. Defaults to on, meaning the session can reuse this connection later; when off, the connection is closed at transaction end.\nPG14+: postgres_fdw_get_connections() can view connection status.\nTransaction Management # Important FDW transaction characteristics:\nThe remote database executes SQL based on the text sent by the local database. When the local database has SERIALIZABLE isolation level, the remote also uses SERIALIZABLE; otherwise, the remote uses REPEATABLE READ. When the local transaction commits or rolls back, the remote transaction also commits or rolls back. FDW does not support 2PC transactions. Without distributed 2PC transaction support, partial commits may occur. For example, even if a remote update fails, the local update can still complete:\n=\u0026gt; select * from tab1; a | b ---+----- 1 | abc =\u0026gt; begin; BEGIN =\u0026gt; update tab1 set b=\u0026#39;123\u0026#39; ; UPDATE 6 =\u0026gt; update bkb.tab1 set b=\u0026#39;a\u0026#39; where c=1; ERROR: 42703: column \u0026#34;c\u0026#34; does not exist LINE 1: update bkb.tab1 set b=\u0026#39;a\u0026#39; where c=1; =\u0026gt; commit; COMMIT =\u0026gt; select * from tab1; a | b ---+----- 1 | 123 No Distributed Lock Management # FDW has no distributed lock management, hence no distributed deadlock detection mechanism.\nDeadlock detection works for local tables but not for foreign tables.\nAsynchronous Execution # Starting from PG14, postgres_fdw supports asynchronous execution. When there are multiple Append nodes in the execution plan, they can execute in parallel, improving performance when accessing multiple foreign tables.\nAsynchronous execution only occurs with multiple sessions — i.e., multiple user mappings. The async_capable option controls this, defaulting to false. The enable_async_append parameter must also be enabled (default on).\nParallel Commit # Starting from PG15, postgres_fdw supports parallel commit. Remote transactions commit alongside local transactions. Without parallel commit/rollback, PG can only commit/rollback remote transactions serially.\npostgres_fdw Version History # Version Release Support Notes 9.3 postgres_fdw released 9.6 Support pushdown of join, sort, update, delete; fetch_size support 10 Push down aggregate functions to remote server; more join pushdown scenarios 11 Push down operators to partitioned tables; UPDATE/DELETE joins can push down 12 More order by/limit pushdown scenarios 13 Enhanced password authentication; pg_dump can export foreign tables 14 Parallel scanning for queries with multiple foreign tables (async_capable); bulk insert; postgres_fdw_get_connections(); TRUNCATE foreign tables 15 Push down CASE expressions; parallel commit (parallel_commit) 16 Interruptible parallel transactions; foreign table analyze_sampling; COPY batch_size; foreign table truncate triggers Sharding Implementation # FDW-based Sharding # Many PostgreSQL forks (XC/XL, Citus, etc.) have implemented sharding, but PostgreSQL itself is a single-instance database without native sharding support. Since SQL/MED was defined for accessing external data, postgres_fdw can implement sharding by accessing external instances.\nCore Sharding Features # Key features needed for usable sharding:\nPartition management — SQL/MED transparency allows sharding on partitioned tables. Partition optimization — partition pruning, PARTITION WISE JOIN, etc. Aggregate pushdown — push computation to shard nodes. Parallel scanning — PG14 implemented. 2PC transactions — FDW doesn\u0026rsquo;t yet support this. Shard management — foreign table partitions must be manually created and added. Global transactions — global clocks, global snapshot management needed. Distributed locks — stronger distributed lock mechanisms needed. Batch writes — DML/COPY distribution to shards needs batch write support. Summary # PostgreSQL\u0026rsquo;s FDW functionality derives from the SQL/MED standard for accessing external data, supporting many data source types. FDW has 4 basic objects: foreign data wrapper, server, user mapping, foreign table. postgres_fdw has many feature enhancements and performance optimizations, capable of pushing operators down to remote databases. Sharding can be implemented based on postgres_fdw, though some features still need improvement. References # https://www.interdb.jp/pg/pgsql04.html https://www.postgresql.org/docs/13/postgres-fdw.html https://www.postgresql.org/docs/current/file-fdw.html https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding https://www.percona.com/blog/postgres_fdw-enhancement-in-postgresql-14/ https://www.percona.com/blog/foreign-data-wrappers-postgresql-postgres_fdw/ https://www.percona.com/blog/parallel-commits-for-transactions-using-postgres_fdw-on-postgresql-15/ https://www.enterprisedb.com/blog/postgresql-aggregate-push-down-postgresfdw https://www.postgresql.fastware.com/postgresql-insider-fdw-ove https://momjian.us/main/writings/pgsql/sharding.pdf https://www.slideserve.com/johnna/sql-med-and-more-powerpoint-ppt-presentation https://dbaplus.cn/news-19-2090-1.html https://www.highgo.ca/2019/08/08/horizontal-scalability-with-sharding-in-postgresql-where-it-is-going-part-3-of-3/ https://www.highgo.ca/2021/06/28/parallel-execution-of-postgres_fdw-scans-in-pg-14-important-step-forward-for-horizontal-scaling/\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/a-brief-analysis-of-postgresql-fdw/","section":"Posts","summary":"FDW Basic Concepts # What is SQL/MED? # SQL/MED aims to unify access methods for heterogeneous data sources. In 2003, SQL/MED was added to the ISO/IEC 9075-9 standard, defined as a SQL standard extension for managing external data via foreign-data wrappers (FDW) or datalink (such as Oracle or PG’s dblink). In short, SQL/MED is an international SQL extension standard. Many databases already support SQL/MED, such as DB2, MariaDB, PG, and more.\n","title":"A Brief Analysis of PostgreSQL FDW","type":"posts"},{"content":" Architecture # (https://www.postgresql.fastware.com/blog/lets-get-back-to-basics-postgresql-memory-components)\n(http://geekdaxue.co/read/fcant@sql/qts5is)\nShared Memory # Linux Shared Memory Implementation # (https://momjian.us/main/writings/pgsql/inside_shmem.pdf)\nShared Memory on Linux Shared memory is an IPC (Inter-Process Communication) mechanism supported by Unix-based operating systems (including Linux). It is a type of memory that multiple processes can simultaneously use to communicate with each other. Shared memory is one of the fastest IPC mechanisms because it does not require processes to copy data between each other. Processes can access shared memory through their own address space.\nTwo Forms of Shared Memory One form of shared memory is memory-mapped files. Once multiple processes map the same file into their address space, they can access the file\u0026rsquo;s contents and simultaneously update the file directly using the mapped memory. Another form of shared memory is anonymous memory. This refers to shared memory regions allocated by programs without associating them with a file or persistent storage mechanism.\nmmap() Mapping a file into a process\u0026rsquo;s address space uses mmap(). Anonymous memory can also be created with mmap(). mmap is part of the standard C library. For anonymous memory, the flags should be MAP_ANONYMOUS or MAP_ANON, in which case fd should be NULL or -1, and offset should be 0.\nhttp://www.tutorialsdaddy.com/courses/linux-device-driver/lessons/mmap/\nShared Memory in PostgreSQL # https://www.interdb.jp/pg/pgsql02.html\nPostgreSQL has many types of shared memory: shared buffers, WAL buffer, CLOG buffer, lock space, etc.\nShared Buffer The shared memory area where PostgreSQL caches data, similar to Oracle\u0026rsquo;s SGA. When data hits the shared buffer, it is read directly from memory without requiring disk I/O. PostgreSQL loads table pages and indexes from persistent storage into this area and operates on them directly.\nWAL Buffer To ensure no data is lost in the event of a server failure, PostgreSQL supports the WAL mechanism. WAL data (also called XLOG records) is PostgreSQL\u0026rsquo;s transaction log. The WAL BUFFER is the buffer for WAL data before it is written to persistent storage.\nCLOG BUFFER The Commit Log (CLOG) maintains the status of all transactions (e.g., in_progress, committed, aborted) for the concurrency control mechanism. The corresponding CLOG BUFFER is the buffer for CLOG data before it is written to disk.\nPostgreSQL Shared Memory Parameters # shared_buffers Default 128MB. Recommended to configure at 25% of total memory. Because PostgreSQL\u0026rsquo;s private memory generally takes up a significant portion and relies on cache, sufficient memory must be left for the OS. It is therefore not recommended to set this to as high a value (relative to total memory) as you would for Oracle\u0026rsquo;s SGA.\nshared_memory_type Specifies the shared memory implementation method, not only for shared_buffers but also for other shared data areas. The shared memory implementation varies by platform. (It appears) on Linux the default is mmap. Other values are:\nposix (for POSIX shared memory allocated using shm_open) sysv (for System V shared memory allocated via shmget) windows (for Windows shared memory) mmap (to simulate shared memory using memory-mapped files stored in the data directory) By default, PostgreSQL uses a very small amount of System V shared memory, with the vast majority being mmap shared memory. Due to differences between POSIX and System V IPC, signal implementations differ. The shared_memory_type parameter can be explicitly adjusted for the IPC implementation mechanism:\nSetting System V IPC (default is mmap): On Linux and FreeBSD systems, the default shared memory system settings are generally sufficient. Setting shared_memory_type to sysv does not take effect on these two platforms (System V semaphores are not used on this platform). On OpenBSD systems, if shared_memory_type is set to sysv, the default shared memory system parameters are insufficient and need to be adjusted via sysctl.\nSetting POSIX IPC: POSIX semaphores are effective on Linux and FreeBSD.\ndynamic_shared_memory_type The mechanism for dynamic shared memory, defaults to posix. This parameter is important for parallel queries. A community email about /dev/shm describes:\nPostgreSQL creates segments in /dev/shm for parallel queries (via\nshm_open()), not for shared buffers. The amount used is controlled by\nwork_mem. Queries can use up to work_mem for each node you see in the\nEXPLAIN plan, and for each process, so it can be quite a lot if you\nhave lots of parallel worker processes and/or lots of\ntables/partitions being sorted or hashed in your query.\nTranslation:\nParallel queries use POSIX and create segments in /dev/shm Parallel queries do NOT use shared_buffers Each plan node in a query is limited by work_mem! min_dynamic_shared_memory The initial size of memory used by parallel queries, allocated at server startup. Related to huge_pages and dynamic_shared_memory_type.\nhuge_pages This parameter controls whether the main shared memory area uses huge pages. This means private memory and OS-level memory are not affected by this setting. PostgreSQL\u0026rsquo;s use of huge pages is currently only supported on Linux and Windows systems; on Linux systems, it is only supported when shared_memory_type is set to mmap!\nSetting Description try default, attempts to allocate huge pages on uses huge pages; server will not start if allocation fails off does not use huge pages huge_page_size Controls the size of huge pages. Default is 0, meaning PostgreSQL uses the huge page size provided by the operating system. Setting a non-default value is only supported on Linux.\nThe pg_shmem_allocations View # pg_shmem_allocations is a view introduced in PG13 that allows viewing the allocation of major shared memory segments, including those from PostgreSQL itself and extensions.\n\u0026gt; select sum(allocated_size)/1024/1024/1024 gb from pg_shmem_allocations; gb -------------------- 2.7658920288085938 \u0026gt;select * from pg_shmem_allocations order by 4 desc; name | off | size | allocated_size -------------------------------------+------------+------------+---------------- Buffer Blocks | 38575360 | 2415919104 | 2415919104 [null] | 2729553280 | 240300672 | 240300672 \u0026lt;anonymous\u0026gt; | [null] | 240198528 | 240198528 Buffer Descriptors | 19700992 | 18874368 | 18874368 XLOG Ctl | 171008 | 16803472 | 16803584 Backend Activity Buffer | 2707733248 | 10680320 | 10680320 ... NULL indicates unused memory, anonymous indicates anonymous page allocations. Most of the memory modules in the pg_shmem_allocations view are difficult to understand. You can find them by searching the source code, but there is no intuitive explanation — it simply displays the data from the source code\u0026rsquo;s init memory module.\nExample: Buffer Blocks: Searching the source code directly for \u0026ldquo;buffer blocks\u0026rdquo;:\n// Initialize shared buffer pool // Called only once, during shared memory initialization void InitBufferPool(void) { bool\tfoundBufs, foundDescs, foundIOCV, foundBufCkpt; /* Align descriptors to a cacheline boundary. */ BufferDescriptors = (BufferDescPadded *) ShmemInitStruct(\u0026#34;Buffer Descriptors\u0026#34;, NBuffers * sizeof(BufferDescPadded), \u0026amp;foundDescs); BufferBlocks = (char *) ShmemInitStruct(\u0026#34;Buffer Blocks\u0026#34;, NBuffers * (Size) BLCKSZ, \u0026amp;foundBufs); /* Align condition variables to cacheline boundary. */ BufferIOCVArray = (ConditionVariableMinimallyPadded *) ShmemInitStruct(\u0026#34;Buffer IO Condition Variables\u0026#34;, NBuffers * sizeof(ConditionVariableMinimallyPadded), \u0026amp;foundIOCV); // Checkpoint BufferIds are used to sort checkpoints in shared memory CkptBufferIds = (CkptSortItem *) ShmemInitStruct(\u0026#34;Checkpoint BufferIds\u0026#34;, NBuffers * sizeof(CkptSortItem), \u0026amp;foundBufCkpt); } The InitBufferPool() function initializes the shared buffer. The shared buffer has 4 sub-pools: Buffer Descriptors, Buffer Blocks, Buffer IO Condition Variables, Checkpoint BufferIds. Private Memory # Private memory is memory areas allocated by PostgreSQL for each session or process. Unlike shared buffers, there is not just one. Private memory of each process cannot be accessed by other processes. temp_buffers Temp buffers are used to cache temporary table data, default 8MB. temp_buffers is private memory, so temporary tables are only visible to the current session.\nwork_mem The maximum memory used by query operations, such as sorts and hash tables. Default 4MB. Each query or each plan node? Official documentation:\nNote that a complex query might perform several sort and hash operations at the same time, with each operation generally being allowed to use as much memory as this value specifies before it starts to write data into temporary files.\nCommunity email about /dev/shm:\nQueries can use up to work_mem for each node you see in the\nEXPLAIN plan,\nThis parameter applies to each operation (plan node) in a query, not to each query. A query can have many parallel operations, so a single query can also consume a lot of memory. Therefore, the work_mem setting must be made very carefully to avoid exhausting OS memory. The worst case: multiple sessions, each session having multiple plan nodes, and those plan nodes using operations that heavily consume work_mem. Which operations use work_mem? For sort operations: ORDER BY, DISTINCT, merge joins. For hash table usage: hash joins, hash-based aggregation, memoize nodes, hash-based IN subqueries.\nhash_mem_multiplier Used to limit the memory size of hash-based operations. The limit is hash_mem_multiplier * work_mem. hash_mem_multiplier defaults to 2. Although work_mem can be limited, you cannot limit how many hash operations a query uses, so PG13 added this parameter. This means that before version 12 (inclusive), it was very difficult to limit hash table memory. In our 9.6 production environment, we found a single session consuming 300GB of memory. The culprit was the lack of hash table limits in older versions combined with an execution plan that incorrectly used hash tables.\nmaintenance_work_mem Memory area used by operations such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. These are session-initiated operations with independent processes that use private memory. These maintenance operations cannot run in parallel within a single session, and concurrency is generally low, so this parameter can be set relatively high. Autovacuum may also use this memory area and limit. See autovacuum_work_mem explanation.\nautovacuum_work_mem Maximum memory used by each autovacuum worker process. Default -1, meaning the maintenance_work_mem parameter is used to limit autovacuum workers. Vacuum uses at most 1GB of memory, and autovacuum has the same limit, so setting the vacuum/autovacuum memory limit above 1GB is meaningless.\nvacuum_buffer_usage_limit Limits the number of pages that VACUUM and ANALYZE can access from shared memory, to prevent too many pages from being evicted. Default is 256KB, 0 means no limit. When using VACUUM or ANALYZE commands, BUFFER_USAGE_LIMIT can be specified, which takes precedence over the GUC parameter vacuum_buffer_usage_limit.\nmax_stack_depth The maximum safe depth of the execution stack, generally meaning the stack depth of a recursive function executed on a single backend process. Default is 2MB. The OS kernel stack limit should be set slightly larger than max_stack_depth. If a recursive function exceeds the stack depth, the following error is reported:\nERROR: stack depth limit exceeded HINT: Increase the configuration parameter max_stack_depth (currently 2048kB), after ensuring the platform\u0026#39;s stack depth limit is adequate. logical_decoding_work_mem Before PG13, logical decoding would retain at most 4096 changes in memory (max_changes_in_memory hardcoded in the source). PG13 introduced the parameter logical_decoding_work_mem. If the data held by logical decoding exceeds this memory value, it is written to disk. Default 64MB.\neach replication connection only uses a single buffer of this size,\nGenerally, the number of logical replication connections is not large, so logical_decoding_work_mem can be set relatively high without issues.\nxxCache # xxCache is also private memory. For example, PostgreSQL caches relation metadata in relcache. The official documentation has relatively little description about this, but PostgreSQL memory problems are often related to it. For instance, the issue of catalog cache causing each backend process to consume a lot of memory without releasing it has appeared in many environments. Here is a community email from 2016 by Digoal about catalog cache consuming excessive memory\nEvery PostgreSQL session holds system data in own cache. Usually this cache is pretty small (for significant numbers of users). But can be pretty big if your catalog is untypically big and you touch almost all objects from\ncatalog in session. A implementation of this cache is simple - there is not\ndelete or limits. There is not garabage collector (and issue related to\nGC), what is great, but the long sessions on big catalog can be problem.\nThe solution is simple - close session over some time or over some number of operations. Then all memory in caches will be released.\nThe community\u0026rsquo;s explanation of catalog cache:\nEach session has its own cache for storing system data (metadata, etc.) Generally, this cache is small. When the catalog is large and a session has accessed all catalog objects, the cache can become very large. Cache management is simple: there is no deletion mechanism or limit (though invalidation messages do exist). Closing the session releases the cache. Tom Lane\u0026rsquo;s solution was also simple and blunt — add more hardware resources:\nI do not think you should complain if that takes a great deal of memory. Either rethink why you need so many tables, or buy hardware commensurate with the size of your problem.\nIn fact, there are many knowledge points about caches worth paying attention to. After understanding their principles, the solutions to cache-caused memory issues may not be limited to just one approach. There are many types of xxCache, such as relcache, syscache, plancache, etc. Since documentation is scarce, understanding xxCache requires reading the source code. The main xxCache source code is under src/backend/utils/cache. Source structure:\ninval.c\t-- Invalidation message dispatcher for private caches. The corresponding shared cache invalidation message handler is sinval.c relfilenodemap.c\t-- relfilenode to oid mapping cache ts_cache.c\t-- Cache for Tsearch (text search) related objects relmapper.c\t-- catalog to relfilenode mapping cache typcache.c\t-- type cache spccache.c\t-- tablespace cache evtcache.c\t-- event trigger cache attoptcache.c\t-- attribute cache plancache.c\t-- plan cache relcache.c\t-- relation cache *Focus of this article* catcache.c\t-- system catalog cache *Focus of this article* syscache.c\t-- one layer above catcache, also system catalog cache\t*Focus of this article* lsyscache.c\t-- routines for conveniently querying catalog cache, \u0026#39;l\u0026#39; likely stands for lookup partcache.c\t-- routines for operating on partition information in relcache In addition to handling various caches, there is also source code for operations and messages. Below we focus on relcache, catcache/syscache, and invalidation messages.\nrelcache # What data does a relcache entry store? Defined in src/include/utils/rel.h:\n* POSTGRES relation descriptor (a/k/a relcache entry) definitions. RelationData is the primary data structure for relcache entries:\ntypedef struct RelationData { RelFileNode rd_node;\t/* physical identifier of relation */ SMgrRelation rd_smgr;\t/* cached file handle, or NULL */ int\trd_refcnt;\t/* reference count */ BackendId\trd_backend;\t/* if temp relation, the owning backend id */ bool\trd_islocaltemp; /* is it a temp rel of the current session */ bool\trd_isnailed;\t/* is it nailed in cache */ bool\trd_isvalid;\t/* is the relcache entry valid */ bool\trd_indexvalid;\t/* are the indexes on the relation valid */ bool\trd_statvalid;\t/* are the statistics on the relation valid */ ... /* some subtransaction info */ SubTransactionId rd_createSubid;\t/* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid;\t/* highest subxact changing rd_node to current value */ SubTransactionId rd_firstRelfilenodeSubid;\t/* highest subxact changing rd_node to any value */ SubTransactionId rd_droppedSubid;\t/* dropped with another Subid set */ Form_pg_class rd_rel;\t/* pointer to the relation\u0026#39;s pg_class tuple */ TupleDesc\trd_att;\t/* tuple descriptor */ Oid\trd_id;\t/* relation\u0026#39;s oid */ LockInfoData rd_lockInfo;\t/* lock info on the relation */ RuleLock *rd_rules;\t/* rewrite rules */ MemoryContext rd_rulescxt;\t/* private memory cxt for rd_rules */ TriggerDesc *trigdesc;\t/* trigger info, NULL if none */ ... /* foreign key info */ List\t*rd_fkeylist;\t/* list of ForeignKeyCacheInfo (see below) */ bool\trd_fkeyvalid;\t/* true if list has been computed */ /* partition info */ PartitionKey rd_partkey;\t/* partition key, or NULL */ MemoryContext rd_partkeycxt;\t/* private context for rd_partkey, if any */ ... List\t*rd_indexlist;\t/* list of all index OIDs */ Oid\trd_pkindex;\t/* primary key oid */ Oid\trd_replidindex; /* replica identity index oid */ List\t*rd_statlist;\t/* list of extended stats OIDs */ ... PublicationDesc *rd_pubdesc;\t/* publication descriptor, or NULL */ ... bytea\t*rd_options;\t/* parsed pg_class.reloptions */ ... Form_pg_index rd_index;\t/* index descriptor in pg_index tuple */ struct HeapTupleData *rd_indextuple;\t/* all pg_index tuples */ MemoryContext rd_indexcxt;\t/* index cxt */ ... void\t*rd_amcache;\t/* available for use by index/table AM */ ... struct FdwRoutine *rd_fdwroutine;\t/* cached function pointers, or NULL */ ... } RelationData; RelationData contains a large amount of relation-related metadata: oid, pg_class, partition tables, subtransactions, row security policies, statistics, index metadata, AM, etc.\nrelcache ROUTINES The ROUTINES source code is located at src/backend/utils/cache/relcache.c. There are mainly 5 stages:\nRelationCacheInitialize - Initialize relcache, initially empty RelationCacheInitializePhase2 - Initialize shared catalogs RelationCacheInitializePhase3 - Complete relcache initialization RelationIdGetRelation - Get relation descriptor by relation id RelationClose - Close a relation These 5 stages are the 5 main logical steps for a rel entry, equivalent to the lifecycle of a rel entry, not the lifecycle of relcache. The first three stages are all relcache initialization — they initialize relcache and load some system tables and their indexes. The last two stages are the logic for obtaining a reldesc and closing a relation; the relcache itself still exists.\nStage 1: RelationCacheInitialize RelationCacheInitialize initializes relcache:\n// Define initial size 400 #define INITRELCACHESIZE\t400 void RelationCacheInitialize(void) { HASHCTL\tctl; int\tallocsize; /* * make sure cache memory context exists */ // Check if cache mctx exists, create one if not if (!CacheMemoryContext) CreateCacheMemoryContext(); // Create hash table indexed by OID for relcache ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelIdCacheEnt); RelationIdCache = hash_create(\u0026#34;Relcache by OID\u0026#34;, INITRELCACHESIZE, \u0026amp;ctl, HASH_ELEM | HASH_BLOBS); ... // Initialize relation mapper RelationMapInitialize(); } RelationCacheInitialize does not allocate any relation operations; it only initializes relcache memory, hash tables, mappers, etc.\nStage 2: RelationCacheInitializePhase2\nvoid RelationCacheInitializePhase2(void) { MemoryContext oldcxt; // Initialize relation mapper RelationMapInitializePhase2(); // If in bootstrap mode, shared catalogs don\u0026#39;t exist yet, so do nothing if (IsBootstrapProcessingMode()) return; // Switch to current cache mctx oldcxt = MemoryContextSwitchTo(CacheMemoryContext); // Try to load shared relcache file if (!load_relcache_init_file(true)) // If init file not loaded { formrdesc(\u0026#34;pg_database\u0026#34;, DatabaseRelation_Rowtype_Id, true, Natts_pg_database, Desc_pg_database); formrdesc(\u0026#34;pg_authid\u0026#34;, AuthIdRelation_Rowtype_Id, true, Natts_pg_authid, Desc_pg_authid); formrdesc(\u0026#34;pg_auth_members\u0026#34;, AuthMemRelation_Rowtype_Id, true, Natts_pg_auth_members, Desc_pg_auth_members); formrdesc(\u0026#34;pg_shseclabel\u0026#34;, SharedSecLabelRelation_Rowtype_Id, true, Natts_pg_shseclabel, Desc_pg_shseclabel); formrdesc(\u0026#34;pg_subscription\u0026#34;, SubscriptionRelation_Rowtype_Id, true, Natts_pg_subscription, Desc_pg_subscription); #define NUM_CRITICAL_SHARED_RELS\t5\t/* fix if you change list above */ } MemoryContextSwitchTo(oldcxt); } The init file is divided into shared and local cache init files. load_relcache_init_file() attempts to load data from these two types of files into relcache (here it should only load the shared ones). If loading fails, it creates descriptors for the 5 basic system tables: pg_database, pg_authid, etc.\nStage 3: RelationCacheInitializePhase3 is the third stage of initialization and contains the most content:\nvoid RelationCacheInitializePhase3(void) { HASH_SEQ_STATUS status; RelIdCacheEnt *idhentry; MemoryContext oldcxt; bool\tneedNewCacheFile = !criticalSharedRelcachesBuilt; RelationMapInitializePhase3(); // Switch to CacheMemoryContext oldcxt = MemoryContextSwitchTo(CacheMemoryContext); // Like stage 2, load more system table descriptors if (IsBootstrapProcessingMode() || !load_relcache_init_file(false)) { needNewCacheFile = true; formrdesc(\u0026#34;pg_class\u0026#34;, RelationRelation_Rowtype_Id, false, Natts_pg_class, Desc_pg_class); formrdesc(\u0026#34;pg_attribute\u0026#34;, AttributeRelation_Rowtype_Id, false, Natts_pg_attribute, Desc_pg_attribute); formrdesc(\u0026#34;pg_proc\u0026#34;, ProcedureRelation_Rowtype_Id, false, Natts_pg_proc, Desc_pg_proc); formrdesc(\u0026#34;pg_type\u0026#34;, TypeRelation_Rowtype_Id, false, Natts_pg_type, Desc_pg_type); #define NUM_CRITICAL_LOCAL_RELS 4\t/* fix if you change list above */ } MemoryContextSwitchTo(oldcxt); ... // If we haven\u0026#39;t obtained critical system indexes yet, do it now // Because catcache and/or opclass cache depend on critical system indexes in relcache if (!criticalRelcachesBuilt) // If critical indexes not loaded { load_critical_index(ClassOidIndexId, RelationRelationId); ... load_critical_index(TriggerRelidNameIndexId, TriggerRelationId); #define NUM_CRITICAL_LOCAL_INDEXES\t7\t/* fix if you change list above */ criticalRelcachesBuilt = true; // Mark: critical system table indexes obtained } // Continue processing shared critical system table indexes. // These shared critical system tables are needed in certain situations (autovacuum, client authentication, etc.) if (!criticalSharedRelcachesBuilt) { load_critical_index(DatabaseNameIndexId, DatabaseRelationId); ... load_critical_index(SharedSecLabelObjectIndexId, SharedSecLabelRelationId); #define NUM_CRITICAL_SHARED_INDEXES 6\t/* fix if you change list above */ criticalSharedRelcachesBuilt = true; // Mark: shared critical system table indexes obtained } // Scan all entries in relcache and update those that are erroneous // from formrdesc or init file // If erroneous, read pg_class data and replace the erroneous entry // Because the cache file does not contain rules, triggers, security policies, // also fetch from pg_class ... while ((idhentry = (RelIdCacheEnt *) hash_seq_search(\u0026amp;status)) != NULL) { Relation\trelation = idhentry-\u0026gt;reldesc; bool\trestart = false; // Ensure relations in use are not flushed RelationIncrementReferenceCount(relation); // If it\u0026#39;s an erroneous entry, read the tuple from pg_class if (relation-\u0026gt;rd_rel-\u0026gt;relowner == InvalidOid) { ... memcpy((char *) relation-\u0026gt;rd_rel, (char *) relp, CLASS_TUPLE_SIZE); // Update rd_option if (relation-\u0026gt;rd_options) pfree(relation-\u0026gt;rd_options); RelationParseRelOptions(relation, htup); ... ReleaseSysCache(htup); ... restart = true; } // Fix data not in the init file // For example, relhasrules, relhastriggers may be outdated or incorrect if (relation-\u0026gt;rd_rel-\u0026gt;relhasrules \u0026amp;\u0026amp; relation-\u0026gt;rd_rules == NULL) { RelationBuildRuleLock(relation); if (relation-\u0026gt;rd_rules == NULL) relation-\u0026gt;rd_rel-\u0026gt;relhasrules = false; restart = true; } if (relation-\u0026gt;rd_rel-\u0026gt;relhastriggers \u0026amp;\u0026amp; relation-\u0026gt;trigdesc == NULL) { RelationBuildTriggers(relation); if (relation-\u0026gt;trigdesc == NULL) relation-\u0026gt;rd_rel-\u0026gt;relhastriggers = false; restart = true; } // Reload row security policies, since init file doesn\u0026#39;t contain them if (relation-\u0026gt;rd_rel-\u0026gt;relrowsecurity \u0026amp;\u0026amp; relation-\u0026gt;rd_rsdesc == NULL) { RelationBuildRowSecurity(relation); Assert(relation-\u0026gt;rd_rsdesc != NULL); restart = true; } // If tableam needs reloading if (relation-\u0026gt;rd_tableam == NULL \u0026amp;\u0026amp; (RELKIND_HAS_TABLE_AM(relation-\u0026gt;rd_rel-\u0026gt;relkind) || relation-\u0026gt;rd_rel-\u0026gt;relkind == RELKIND_SEQUENCE)) { RelationInitTableAccessMethod(relation); Assert(relation-\u0026gt;rd_tableam != NULL); restart = true; } // Decrement reference count RelationDecrementReferenceCount(relation); ... // Finally, if needed, update the init file (since there may have been reloads, don\u0026#39;t waste them) if (needNewCacheFile) { InitCatalogCachePhase2(); /* now write the files */ write_relcache_init_file(true); // Write global init file write_relcache_init_file(false); // Write private init file } } Compared to Stage 2 which loads 5 system tables, RelationCacheInitializePhase3() loads more system tables, such as pg_class, pg_proc, and the indexes on these tables. Of course, the precondition for loading these rels is that they are not in cache or have expired. After reloading is complete, the \u0026ldquo;new\u0026rdquo; catalog is written to the init file. Looking at the write_relcache_init_file function source code when writing the init file, we can understand the meaning of the true and false parameters:\nstatic void write_relcache_init_file(bool shared) { ... if (shared) { snprintf(tempfilename, sizeof(tempfilename), \u0026#34;global/%s.%d\u0026#34;, RELCACHE_INIT_FILENAME, MyProcPid); snprintf(finalfilename, sizeof(finalfilename), \u0026#34;global/%s\u0026#34;, RELCACHE_INIT_FILENAME); } else { snprintf(tempfilename, sizeof(tempfilename), \u0026#34;%s/%s.%d\u0026#34;, DatabasePath, RELCACHE_INIT_FILENAME, MyProcPid); snprintf(finalfilename, sizeof(finalfilename), \u0026#34;%s/%s\u0026#34;, DatabasePath, RELCACHE_INIT_FILENAME); } ... } true means write to the global init file. false means write to the local init file.\nThe RELCACHE_INIT_FILENAME parameter macro definition:\n#define RELCACHE_INIT_FILENAME \u0026#34;pg_internal.init\u0026#34; So the written init files are:\nshared: global/pg_internal.init local: DatabasePath/pg_internal.init and DatabasePath/pg_internal.init.myPID Let\u0026rsquo;s look at real init file paths:\n[postgres]$ find ./ -name *init* ./global/pg_internal.init #shared ./base/1/pg_internal.init #local ./base/13577/pg_internal.init #local ./base/13578/pg_internal.init\t#local ./base/16398/pg_internal.init\t#local ./base/16811/pg_internal.init\t#local ./base/17674/pg_internal.init\t#local Diagram of the three initialization stages call flow: (https://blog.japinli.top/2022/07/postgres-relcache-and-syscache/)\nStage 4: RelationIdGetRelation Find a reldesc by OID. The caller only needs an AccessShareLock on the OID and is responsible for incrementing/decrementing the rel\u0026rsquo;s reference count.\nRelation RelationIdGetRelation(Oid relationId) { Relation\trd; // Ensure we\u0026#39;re in a transaction Assert(IsTransactionState()); // First try to find in cache via reldesc RelationIdCacheLookup(relationId, rd); if (RelationIsValid(rd)) { // Return NULL for dropped relations if (rd-\u0026gt;rd_droppedSubid != InvalidSubTransactionId) { Assert(!rd-\u0026gt;rd_isvalid); return NULL; } RelationIncrementReferenceCount(rd); if (!rd-\u0026gt;rd_isvalid) // If cached rel is invalid, revalidate it { if (rd-\u0026gt;rd_rel-\u0026gt;relkind == RELKIND_INDEX || rd-\u0026gt;rd_rel-\u0026gt;relkind == RELKIND_PARTITIONED_INDEX) // Load index info directly RelationReloadIndexInfo(rd); else // For non-index, clear the reldesc RelationClearRelation(rd, true); ... } return rd; } // No reldesc found, create a new one rd = RelationBuildDesc(relationId, true); if (RelationIsValid(rd)) RelationIncrementReferenceCount(rd); return rd; } RelationIdGetRelation is relatively simple: it obtains a reldesc and index info via OID.\nStage 5: RelationClose The code for RelationClose is also quite simple:\nvoid RelationClose(Relation relation) { // No lock operations needed, simply decrement refcount RelationDecrementReferenceCount(relation); // If no sessions have the relation open, partition descriptors can be deleted if (RelationHasReferenceCountZero(relation)) { if (relation-\u0026gt;rd_pdcxt != NULL \u0026amp;\u0026amp; relation-\u0026gt;rd_pdcxt-\u0026gt;firstchild != NULL) MemoryContextDeleteChildren(relation-\u0026gt;rd_pdcxt); if (relation-\u0026gt;rd_pddcxt != NULL \u0026amp;\u0026amp; relation-\u0026gt;rd_pddcxt-\u0026gt;firstchild != NULL) MemoryContextDeleteChildren(relation-\u0026gt;rd_pddcxt); } #ifdef RELCACHE_FORCE_RELEASE if (RelationHasReferenceCountZero(relation) \u0026amp;\u0026amp; relation-\u0026gt;rd_createSubid == InvalidSubTransactionId \u0026amp;\u0026amp; relation-\u0026gt;rd_firstRelfilenodeSubid == InvalidSubTransactionId) RelationClearRelation(relation, false); #endif } RelationClose is the operation for closing access to a relation. Generally, this function only decrements the refcount of sessions accessing the relation. However, there are exceptions:\nWhen refcount is 0, MemoryContextDeleteChildren() is executed. This function deletes the mctx related to child partition descriptors, which does release memory. When refcount is 0 and the macro RELCACHE_FORCE_RELEASE is defined, the RelationClearRelation() function deletes the hash table entry. This step does not release memory. The RELCACHE_FORCE_RELEASE macro was not found (only available with explicit compilation?). relcache is not completely without memory release logic, but the trigger conditions are relatively strict, and the freed memory is not all of the relcache memory.\nsyscache/catcache # CatCache caches tuples from system tables. Built on top of CatCache is another layer called SysCache (KV interface). Essentially, CatCache and SysCache together reorganize data from system tables in memory using a KV approach. syscache/catcache is more complex. Here I\u0026rsquo;ll briefly extract some easily interpretable content, mainly to understand the cached content and loading mechanism of syscache. For deeper source code analysis, refer to PostgreSQL Source Analysis — Storage Management — Memory Management (3) and PostgreSQL RelCache and SysCache Caches.\ncatcache struct\ntypedef struct catcache { int\tid;\t// cache id, defined in syscache.h int\tcc_nbuckets;\t// number of hash buckets for this cache TupleDesc\tcc_tupdesc;\t// tuple descriptor, copied from reldesc ... const char *cc_relname;\t// system table name corresponding to the tuple Oid\tcc_reloid;\t// system table OID Oid\tcc_indexoid;\t// index OID for cache key bool\tcc_relisshared; // is the table shared across databases? ... // Statistics used by catcache #ifdef CATCACHE_STATS long\tcc_searches;\t// number of queries against this catcache long\tcc_hits;\t// hit count long\tcc_neg_hits;\t// negative entry hit count ... #endif } CatCache; catcache entry\ntypedef struct catctup { int\tct_magic;\t// identifies this catctup entry #define CT_MAGIC 0x57261502 uint32\thash_value;\t// hash key value for this tuple ... // Dead tuples won\u0026#39;t be returned, but will be removed from catcache when refcount reaches zero int\trefcount;\t// tuple refcount, indicates whether it\u0026#39;s being accessed bool\tdead;\t// dead tuple, but not yet cleaned up bool\tnegative;\t// is this a negative cache entry? HeapTupleData tuple;\t// tuple header structure ... CatCache *my_cache;\t// link to the catcache this tuple belongs to } CatCTup; SearchCatCacheMiss() Function SearchCatCacheMiss() is the main function for catcache hit/miss, and after a miss it accesses tuples from the dictionary.\nstatic pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, int nkeys, uint32 hashValue, Index hashIndex, Datum v1, Datum v2, Datum v3, Datum v4) { ScanKeyData cur_skey[CATCACHE_MAXKEYS]; Relation\trelation; SysScanDesc scandesc; HeapTuple\tntp; CatCTup *ct; Datum\targuments[CATCACHE_MAXKEYS]; ... // Tuple not found in cache, so try to find it directly from the table // If found, add it to cache // If not found, add a negative cache entry relation = table_open(cache-\u0026gt;cc_reloid, AccessShareLock); scandesc = systable_beginscan(relation, cache-\u0026gt;cc_indexoid, IndexScanOK(cache, cur_skey), NULL, nkeys, cur_skey); ct = NULL; // When tuple is valid, create an entry while (HeapTupleIsValid(ntp = systable_getnext(scandesc))) { ct = CatalogCacheCreateEntry(cache, ntp, arguments, hashValue, hashIndex, false); // Create an entry // Immediately increment refcount ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner); ct-\u0026gt;refcount++; ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, \u0026amp;ct-\u0026gt;tuple); break;\t/* assume only one match */ } systable_endscan(scandesc); table_close(relation, AccessShareLock); /* // If no tuple found, create a negative cache entry (a dummy tuple) // The dummy tuple has key columns, all others are null // During startup, the invalidation mechanism is not active and entries // cannot be cleaned up if a tuple is actually created later // So during this phase, negative entries are not created */ if (ct == NULL) // If no tuple found, enter the following logic { if (IsBootstrapProcessingMode()) // Return NULL directly if in startup phase return NULL; ct = CatalogCacheCreateEntry(cache, NULL, arguments, hashValue, hashIndex, true); // Create entry CACHE_elog(DEBUG2, \u0026#34;SearchCatCache(%s): Contains %d/%d tuples\u0026#34;, cache-\u0026gt;cc_relname, cache-\u0026gt;cc_ntup, CacheHdr-\u0026gt;ch_ntup); CACHE_elog(DEBUG2, \u0026#34;SearchCatCache(%s): put neg entry in bucket %d\u0026#34;, cache-\u0026gt;cc_relname, hashIndex); // Negative entries are not returned to caller, refcount remains 0 return NULL; } ... return \u0026amp;ct-\u0026gt;tuple; } The dummy tuple (negative cache entry) here is brilliant — caching a non-existent tuple in catcache prevents needing to query the data dictionary again on the next access, avoiding repeated pointless data dictionary lookups.\nCache Validation Messages # When a tuple is updated or deleted, due to transaction visibility rules, these tuples that become invisible after the transaction ends need to be communicated to caches, invalidating the cached tuples so they can be reloaded on the next read. Similarly, when new tuples are inserted, negative cache entries in caches may also need to be flushed to match the new tuples. One common scenario is DDL — DDL may cause certain tuples in the metadata to become invalid, at which point cache validation messages need to be sent to various private caches to clean up cache entries. This cache validation mechanism applies to managing private cache pools like syscache and relcache. Since idle backends won\u0026rsquo;t read sinval events, messages must be actively sent to allow lagging backends to \u0026ldquo;catch up.\u0026rdquo; When completing a transaction, invalidation events must be broadcast to other backends via the SI message queue.\nThe source code is split into two parts: sinval and inval.\nInvalidation interface: src/include/utils/inval.h Invalidation dispatch: src/backend/utils/cache/inval.c Invalidation message sharing interface: src/include/storage/sinval.h Invalidation message sharing dispatch: src/backend/storage/ipc/sinval.c Invalidation message sharing data structures interface: src/include/storage/sinvaladt.h Invalidation message sharing data structures: src/backend/storage/ipc/sinvaladt.c In src/backend/utils/cache/inval.c, the shared-invalidation message structure is defined:\ntypedef union { int8\tid;\t/* type field --- must be first */ SharedInvalCatcacheMsg cc; SharedInvalCatalogMsg cat; SharedInvalRelcacheMsg rc; SharedInvalSmgrMsg sm; SharedInvalRelmapMsg rm; SharedInvalSnapshotMsg sn; } SharedInvalidationMessage; Shared-invalidation messages include the following types:\nInvalidate a specific catcache entry Invalidate the entire catcache entry for a particular system catalog Invalidate a specific relcache entry Invalidate ALL relcache entries Invalidate the smgr cache entry for a particular physical relation Invalidate a mapped-relation Invalidate saved snapshots that scanned a relation Messages are located in the shared memory queue until all other processes read them. Normally, receiving processes only read messages at specific times, so if a receiving process is idle (not processing any user requests) or busy doing other things such that they don\u0026rsquo;t have time to read these messages, the messages may remain in shared memory indefinitely. In unfortunate situations, if this shared memory space is no longer available for processes to store new messages, that process will have to take on the cleanup task. (In practice, this cleanup is done proactively, so space rarely runs out.) To discard old messages, it must be ensured that all other processes have read them. If some processes cannot do so for the above reasons, it must explicitly signal the lagging processes to catch up. Once the lagging processes have caught up, these messages can be freely discarded. When processing a message, it first checks whether the catalog tuple specified in the message is currently in the cache (the message also specifies the syscache identifier). If so, it is removed from the cache\u0026rsquo;s hash table. The next time that tuple is requested, it will be re-read from the underlying catalog table and added to the hash table, so subsequent accesses will read the new value. If a process has already locked a particular database object preventing concurrent processes from modifying it, it can continue using the cached tuple until the lock is released.\nxxCache Issues Summary # There are many types of xxCache, among which the more notable ones are plancache, relcache, and syscache. These caches belong to private memory and exist in each backend process. These caches have no LRU mechanism to evict stale data; they use invalidation messages to clean up globally-unneeded snapshots and metadata information, such as when an object is deleted.\nrelcache is the place most likely to occupy significant memory. relcache loads metadata information, and during initialization it reads *.init files to speed up loading metadata into relcache. Later, when other metadata needs to be read, loading also occurs. catcache caches tuple information from the data dictionary. syscache is one layer above catcache — they can be understood as jointly implementing this data dictionary cache. If a tuple does not exist, a negative entry is created to avoid accessing the data dictionary again on the next visit. Similarly, a catcache miss will also read tuples from the data dictionary. Cache validation messages exist to inform caches that cached tuples and snapshot information have become stale. They can invalidate corresponding relcache and catcache entries. Entries are removed from the cache\u0026rsquo;s hash table, which releases memory. Since the cache memory release mechanisms are very limited, when there is a lot of metadata (many tables, partition tables), relcache and catcache can consume a lot of memory — and this can happen for every backend. Possible solutions:\nGlobal cache. Like Oracle\u0026rsquo;s dictionary cache, cache in one place with shared access. For example, PolarDB\u0026rsquo;s Global RelCache has already implemented this functionality. LRU. An LRU mechanism suitable for caches is needed to separate hot and cold ends, cleaning excessively old cache entries from the hash table. This might require cache limit parameters to restrict cache size — ideally one per cache\u0026hellip; Threading mode. Memory is shared and accessed by all threads — a natural advantage. Periodically disconnect long connections. All of the above are just wishful thinking. Don\u0026rsquo;t create too many tables or partitions (note that in PostgreSQL, partitions are also tables). Memory Contexts # PostgreSQL manages memory through the memory context mechanism. I previously did a translation about memory contexts, roughly summarized as follows:\nC language requires explicit memory deallocation. To reduce the risk of memory leaks, PostgreSQL implemented memory contexts to manage private memory. Memory contexts do not require freeing memory after each use; instead, memory is released by deleting a particular context. Memory contexts form a hierarchical structure — releasing a parent context recursively deletes all child contexts. Aside from debugging, observing memory context usage is quite difficult. Starting from PG14, the pg_backend_memory_contexts view can observe the current memory context usage of the current session. Timing of memory context creation during SQL operations: (https://www.pgcon.org/2019/schedule/attachments/514_introduction-memory-contexts.pdf)\nSource Code Analysis # In PostgreSQL, all memory allocation, deallocation, and resetting is done within memory contexts, so the malloc(), realloc(), and free() system call functions are not used directly. Instead, palloc(), repalloc(), and pfree() are used for memory allocation, reallocation, and deallocation.\nC Library Memory Functions C library dynamic memory allocation functions include:\nmalloc(): The C library\u0026rsquo;s malloc() function (memory allocation) is used to allocate large blocks of memory. calloc(): The C library\u0026rsquo;s calloc() function (contiguous allocation) is used to allocate contiguous memory. free(): Used to release memory. malloc() and calloc() do not release memory; after dynamic memory allocation, free() must be used to release it. realloc(): Used for memory re-allocation. There is also a C library function memset(), used to fill a memory block with a specific value.\nPostgreSQL-Defined Memory Functions The functions actually heavily used in PostgreSQL source code for memory allocation, deallocation, etc., are palloc(), palloc0(), repalloc(), and pfree(). They mostly do not directly interact with OS memory (C library functions); only in certain cases do they call C library memory functions. This essentially adds a layer of protection over OS memory operations, with PostgreSQL handling small memory operations on its own.\npalloc(): palloc() primarily calls the alloc method of MemoryContext. alloc corresponds to calling the MemoryContextAlloc function, which in turn calls the AllocSetAlloc function specified in the methods field of the current memory context.\nvoid * palloc(Size size) { /* duplicates MemoryContextAlloc to avoid increased overhead */ void\t*ret; MemoryContext context = CurrentMemoryContext; ... ret = context-\u0026gt;methods-\u0026gt;alloc(context, size); .... return ret; } palloc0():\nvoid * palloc0(Size size) { ... ret = context-\u0026gt;methods-\u0026gt;alloc(context, size); ... MemSetAligned(ret, 0, size); return ret; } MemSetAligned is macro-defined and actually calls C library memset for memory filling, but MemSetAligned passes 0 as the value.\n#define MemSetAligned(start, val, len)\\ ...\\ memset(_start, _val, _len); \\ ...\tCompared to palloc, palloc0 not only calls alloc(context, size) but also zeroes out the memory content.\nrepalloc(): repalloc() primarily calls the realloc method of MemoryContext. The realloc function pointer corresponds to the AllocSetRealloc function.\n/* * repalloc *\tAdjust the size of a previously allocated chunk. */ void * repalloc(void *pointer, Size size) { MemoryContext context = GetMemoryChunkContext(pointer); ... ret = context-\u0026gt;methods-\u0026gt;realloc(context, pointer, size); ... return ret; } pfree(): pfree calls the free_p function pointer in the methods field of the memory context to which the memory chunk belongs, to release the memory chunk\u0026rsquo;s space. Currently, in PostgreSQL, the free_p pointer actually points to the AllocSetFree function.\n/* * pfree *\tRelease an allocated chunk. */ void pfree(void *pointer) { MemoryContext context = GetMemoryChunkContext(pointer); context-\u0026gt;methods-\u0026gt;free_p(context, pointer); VALGRIND_MEMPOOL_FREE(context, pointer); } AllocSetAlloc Memory Allocation Looking at the alloc method within, alloc ultimately points to the AllocSetAlloc function. AllocSetAlloc looks rather complex, but it becomes easier to understand when read in segments:\nstatic void * AllocSetAlloc(MemoryContext context, Size size) { AllocSet\tset = (AllocSet) context; AllocBlock\tblock; AllocChunk\tchunk; int\tfidx; Size\tchunk_size; Size\tblksize; ... // If requested memory exceeds the max chunk size, allocate an entire memory block if (size \u0026gt; set-\u0026gt;allocChunkLimit) { ... block = (AllocBlock) malloc(blksize); ... } // If requested memory is less than chunk size, check free list for available free chunks fidx = AllocSetFreeIndex(size); chunk = set-\u0026gt;freelist[fidx]; if (chunk != NULL) // There are chunks available in the free list { Assert(chunk-\u0026gt;size \u0026gt;= size); set-\u0026gt;freelist[fidx] = (AllocChunk) chunk-\u0026gt;aset; chunk-\u0026gt;aset = (void *) set; ... return AllocChunkGetPointer(chunk); } ... // If there\u0026#39;s space, try to place the chunk in the allocation block; if not, create a new block if ((block = set-\u0026gt;blocks) != NULL) { Size\tavailspace = block-\u0026gt;endptr - block-\u0026gt;freeptr; if (availspace \u0026lt; (chunk_size + ALLOC_CHUNKHDRSZ)) { ... block = NULL; } } // No space, create a new block if (block == NULL) { Size\trequired_size; ... // Requested block size is a power of 2, not exceeding maxBlockSize required_size = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; while (blksize \u0026lt; required_size) blksize \u0026lt;\u0026lt;= 1; // Use malloc to allocate the block, size is a power of 2 block = (AllocBlock) malloc(blksize); ... } (https://smartkeyerror.com/PostgreSQL-MemoryContext)\npalloc() =\u0026gt; AllocSetAlloc() only calls malloc() to request memory from the OS when the requested memory exceeds the chunk size limit or when there are no free blocks in the freelist. In all other cases, it takes existing free chunks from the freelist.\npfree() is similar (not demonstrated here): pfree() =\u0026gt; AllocSetFree() releases a specified memory chunk in a memory context. If the chunk to be freed is the only chunk in the memory block, free() is called directly to release that memory block. Otherwise, the specified chunk is added to the freelist for the next allocation.\nViewing Memory Context Size # PG14+: pg_backend_memory_contexts view to directly inspect memory context memory within the database. lzldb=\u0026gt; SELECT * FROM pg_backend_memory_contexts ORDER BY used_bytes DESC LIMIT 5; name | ident | parent | level | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes -------------------------+-------+------------------+-------+-------------+---------------+------------+-------------+------------ CacheMemoryContext | | TopMemoryContext | 1 | 1048576 | 8 | 508216 | 1 | 540360 Timezones | | TopMemoryContext | 1 | 104120 | 2 | 2616 | 0 | 101504 TopMemoryContext | | | 0 | 97680 | 5 | 12904 | 7 | 84776 ExecutorState | | PortalContext | 3 | 49208 | 4 | 4424 | 3 | 44784 WAL record construction | | TopMemoryContext | 1 | 49768 | 2 | 6360 | 0 | 43408 PG14+: pg_log_backend_memory_contexts function outputs memory information to the log file, producing output similar to MemoryContextStats(TopMemoryContext) log output. SELECT pg_log_backend_memory_contexts(9293); Universal — gdb MemoryContextStats(TopMemoryContext) Use gdb to call MemoryContextStats(TopMemoryContext):\ngdb (gdb) attach 9293 (gdb) p MemoryContextStats(TopMemoryContext) $2 = void Log output:\nTopMemoryContext: 97680 total in 5 blocks; 16856 free (16 chunks); 80824 used TableSpace cache: 8192 total in 1 blocks; 2088 free (0 chunks); 6104 used RowDescriptionContext: 8192 total in 1 blocks; 6888 free (0 chunks); 1304 used MessageContext: 8192 total in 1 blocks; 6888 free (1 chunks); 1304 used Operator class cache: 8192 total in 1 blocks; 552 free (0 chunks); 7640 used ... Relcache by OID: 16384 total in 2 blocks; 3504 free (2 chunks); 12880 used CacheMemoryContext: 524288 total in 7 blocks; 90840 free (0 chunks); 433448 used index info: 2048 total in 2 blocks; 904 free (0 chunks); 1144 used: pg_statistic_ext_relid_index ... index info: 2048 total in 2 blocks; 824 free (0 chunks); 1224 used: pg_database_oid_index index info: 2048 total in 2 blocks; 824 free (0 chunks); 1224 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6360 free (0 chunks); 43408 used PrivateRefCount: 8192 total in 1 blocks; 2616 free (0 chunks); 5576 used MdSmgr: 8192 total in 1 blocks; 7592 free (0 chunks); 600 used LOCALLOCK hash: 8192 total in 1 blocks; 552 free (0 chunks); 7640 used Timezones: 104120 total in 2 blocks; 2616 free (0 chunks); 101504 used ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 264 used Summary # references # src/backend/utils/mmgr/mcxt.c\nsrc/backend/utils/mmgr/README\nhttps://momjian.us/main/writings/pgsql/inside_shmem.pdf\nhttps://www.interdb.jp/pg/pgsql02.html\nhttps://www.postgresql.org/docs/current/runtime-config-resource.htm\nhttps://www.postgresql.org/docs/16/kernel-resources.html\nhttps://blog.csdn.net/weixin_45644897/article/details/121340327\nhttps://help.aliyun.com/zh/polardb/polardb-for-postgresql/global-cache\nhttps://www.cnblogs.com/feishujun/p/PostgreSQLSourceAnalysis_cache02.html\nhttps://blog.japinli.top/2022/07/postgres-relcache-and-syscache/\nhttps://amitlan.com/2019/06/14/caches-inval.html\nhttps://www.cybertec-postgresql.com/en/memory-context-for-postgresql-memory-management/\nhttps://www.geeksforgeeks.org/dynamic-memory-allocation-in-c-using-malloc-calloc-free-and-realloc/\nhttps://www.cnblogs.com/feishujun/p/PostgreSQLSourceAnalysis_mmgr01.html\nhttps://www.cnblogs.com/feishujun/p/PostgreSQLSourceAnalysis_mmgr02.html\nhttps://smartkeyerror.com/PostgreSQL-MemoryContext\nhttps://jnidzwetzki.github.io/2022/05/28/postgres-memory-context.html\nhttps://www.pgcon.org/2019/schedule/attachments/514_introduction-memory-contexts.pdf\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/a-brief-analysis-of-postgresql-memory/","section":"Posts","summary":"Architecture # (https://www.postgresql.fastware.com/blog/lets-get-back-to-basics-postgresql-memory-components)\n(http://geekdaxue.co/read/fcant@sql/qts5is)\nShared Memory # Linux Shared Memory Implementation # (https://momjian.us/main/writings/pgsql/inside_shmem.pdf)\nShared Memory on Linux Shared memory is an IPC (Inter-Process Communication) mechanism supported by Unix-based operating systems (including Linux). It is a type of memory that multiple processes can simultaneously use to communicate with each other. Shared memory is one of the fastest IPC mechanisms because it does not require processes to copy data between each other. Processes can access shared memory through their own address space.\n","title":"A Brief Analysis of PostgreSQL Memory","type":"posts"},{"content":" Command Options # TRUNCATE [ TABLE ] [ ONLY ] name [ * ] [, ... ] [ RESTART IDENTITY | CONTINUE IDENTITY ] [ CASCADE | RESTRICT ] 1. ONLY: truncate only the specified table. When a table has inheritance children or child partitions, by default they are truncated together; ONLY can truncate just the inheritance parent table. Partitioned parent tables cannot specify ONLY.\n-- Cannot truncate only a partitioned parent table =\u0026gt; truncate only parttable; ERROR: 42809: cannot truncate only a partitioned table HINT: Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly. LOCATION: ExecuteTruncate, tablecmds.c:1655 -- truncate only the inheritance parent table, only the parent is cleaned =\u0026gt; truncate table only parenttable; TRUNCATE TABLE =\u0026gt; select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ------------+------- childtable | 1 -- Directly truncate the inheritance parent table, child tables are also cleaned =\u0026gt; truncate table parenttable; TRUNCATE TABLE =\u0026gt; select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ----------+------- (0 rows) 2. RESTART IDENTITY CONTINUE IDENTITY: whether to reset sequences on columns. Default is CONTINUE.\n-- bigserial creates a column sequence by default =\u0026gt; create table tableserial (a bigserial not null,b name); CREATE TABLE =\u0026gt; \\d+ tableserial; Table \u0026#34;public.tableserial\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+--------+-----------+----------+----------------------------------------+---------+--------------+------------- a | bigint | | not null | nextval(\u0026#39;tableserial_a_seq\u0026#39;::regclass) | plain | | b | name | | | | plain | | =\u0026gt; insert into tableserial(b) select md5(random()::text) from generate_series(1,1000); INSERT 0 1000 -- seq current value is 1000 =\u0026gt; select currval(\u0026#39;tableserial_a_seq\u0026#39;::regclass); currval --------- 1000 -- Direct truncate does not reset sequences by default =\u0026gt; truncate table tableserial; TRUNCATE TABLE =\u0026gt; select currval(\u0026#39;tableserial_a_seq\u0026#39;::regclass) cur,nextval(\u0026#39;tableserial_a_seq\u0026#39;::regclass); cur | nextval ------+--------- 1000 | 1001 -- Explicitly specify RESTART IDENTITY to reset sequences =\u0026gt; truncate table tableserial RESTART IDENTITY; TRUNCATE TABLE -- Note: seq is reset on nextval =\u0026gt; select currval(\u0026#39;tableserial_a_seq\u0026#39;::regclass) cur,nextval(\u0026#39;tableserial_a_seq\u0026#39;::regclass); cur | nextval ------+--------- 1001 | 1 3. CASCADE: truncate the table and all foreign key referencing tables.\n-- Create primary table, foreign key table, and data =\u0026gt; create table pri_tab(id bigint primary key,name varchar(10)); CREATE TABLE =\u0026gt; insert into pri_tab values (1,\u0026#39;abc\u0026#39;),(2,\u0026#39;abc\u0026#39;),(3,\u0026#39;abc\u0026#39;); INSERT 0 3 =\u0026gt; create table frn_tab(id bigint,FOREIGN KEY (id) REFERENCES pri_tab(id)); CREATE TABLE =\u0026gt; insert into frn_tab values (1),(2); INSERT 0 2 =\u0026gt; select * from pri_tab; id | name ----+------ 1 | abc 2 | abc 3 | abc (3 rows) -- Foreign key table frn_tab depends on pri_tab\u0026#39;s data =\u0026gt; select * from frn_tab; id ---- 1 2 (2 rows) -- With foreign key references on the primary table, CASCADE is required on the foreign key table, otherwise truncate fails =\u0026gt; truncate table pri_tab ; ERROR: 0A000: cannot truncate a table referenced in a foreign key constraint DETAIL: Table \u0026#34;frn_tab\u0026#34; references \u0026#34;pri_tab\u0026#34;. HINT: Truncate table \u0026#34;frn_tab\u0026#34; at the same time, or use TRUNCATE ... CASCADE. LOCATION: heap_truncate_check_FKs, heap.c:3427 -- Clear foreign key constrained tables together =\u0026gt; truncate table pri_tab cascade; NOTICE: 00000: truncate cascades to table \u0026#34;frn_tab\u0026#34; LOCATION: ExecuteTruncateGuts, tablecmds.c:1725 TRUNCATE TABLE =\u0026gt; select * from pri_tab; id | name ----+------ (0 rows) =\u0026gt; select * from frn_tab; id ---- (0 rows) Since the foreign key table depends on the primary table\u0026rsquo;s data, you cannot directly truncate the primary table — you must add CASCADE, at which point the foreign key table is also cleared along with the primary table.\n4. RESTRICT Whether to clear foreign key tables. Not very useful — it\u0026rsquo;s the default option, and behavior is the same whether specified or not. Use CASCADE to clear associated foreign key tables.\nMVCC / Transaction # The PG official documentation has this passage:\nTRUNCATE is not MVCC-safe. After truncation, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the truncation occurred. TRUNCATE is transaction-safe with respect to the data in the tables: the truncation will be safely rolled back if the surrounding transaction does not commit.\ntransaction-safe means it can be placed inside a transaction block and can be rolled back. Rolling back truncate:\n=\u0026gt; begin; BEGIN =\u0026gt; truncate t1; TRUNCATE TABLE =\u0026gt; rollback; ROLLBACK =\u0026gt; select count(*) from t1; count ------- 100 not MVCC-safe means: if a session takes a snapshot before truncate, and a truncate occurs during the snapshot period, that snapshot can read the result after truncate. This does not conform to MVCC. However, this isn\u0026rsquo;t a big issue in session scenarios, because truncate takes an 8-level lock (AccessExclusiveLock). If the snapshot hasn\u0026rsquo;t ended, at minimum there\u0026rsquo;s a read shared lock on the table, so truncate won\u0026rsquo;t execute.\nThis will only be an issue for a transaction that did not access the table in question before the DDL command started — any transaction that has done so would hold at least an ACCESS SHARE table lock, which would block the DDL command until that transaction completes.\nFeature Updates # There aren\u0026rsquo;t many truncate feature updates. Just note that PG14 added support for truncating foreign tables. The prerequisite for truncating foreign tables is that the FDW must support the TRUNCATE API.\nAlso it extends postgres_fdw so that it can issue TRUNCATE command to foreign servers, by adding new routine for that TRUNCATE API.\nFunctional Differences Between pg TRUNCATE and Other Databases # TRUNCATE being fast and an 8-level lock are already well-known traits. Compared to other databases, PG can also: choose whether to reset sequences (RESTART IDENTITY CONTINUE IDENTITY), rollback, and has simple authorization.\nWhat TRUNCATE Does # create table lzl(a int); create index lzl_idx on lzl(a); create sequence lzl_seq start with 1; alter table lzl alter column a set default nextval(\u0026#39;lzl_seq\u0026#39;); --select pg_relation_filepath(\u0026#39;lzl\u0026#39;); -- db path =\u0026gt; select oid from pg_database where datname=\u0026#39;lzldb\u0026#39;; oid -------- 418679 -- When first created, each rel\u0026#39;s oid = relfilenode =\u0026gt; select relname,oid,relfilenode,relkind from pg_class where relname like \u0026#39;lzl%\u0026#39;; relname | oid | relfilenode | relkind ---------+--------+-------------+--------- lzl | 428363 | 428363 | r lzl_idx | 428366 | 428366 | i lzl_seq | 428367 | 428367 | S (3 rows) =\u0026gt; truncate table lzl; TRUNCATE TABLE =\u0026gt; select relname,oid,relfilenode,relkind from pg_class where relname like \u0026#39;lzl%\u0026#39;; relname | oid | relfilenode | relkind ---------+--------+-------------+--------- lzl | 428363 | 428370 | r lzl_idx | 428366 | 428371 | i lzl_seq | 428367 | 428367 | S -- After truncate, table and index were rebuilt, but sequence was not M=\u0026gt; truncate table lzl RESTART IDENTITY; TRUNCATE TABLE =\u0026gt; select relname,oid,relfilenode,relkind from pg_class where relname like \u0026#39;lzl%\u0026#39;; relname | oid | relfilenode | relkind ---------+--------+-------------+--------- lzl | 428363 | 428372 | r lzl_idx | 428366 | 428373 | i lzl_seq | 428367 | 428367 | S -- Even with explicit RESTART, sequence was still not rebuilt M=\u0026gt; alter sequence lzl_seq restart; ALTER SEQUENCE M=\u0026gt; select relname,oid,relfilenode,relkind from pg_class where relname like \u0026#39;lzl%\u0026#39;; relname | oid | relfilenode | relkind ---------+--------+-------------+--------- lzl | 428363 | 428372 | r lzl_idx | 428366 | 428373 | i lzl_seq | 428367 | 428374 | S -- Explicitly restarting the sequence DOES rebuild it truncate ... RESTART IDENTITY did not rebuild our sequence, while alter sequence lzl_seq restart did rebuild the sequence. It seems the understanding of RESTART IDENTITY was wrong. Let\u0026rsquo;s look at the official documentation for RESTART IDENTITY:\nAutomatically restart sequences owned by columns of the truncated table(s).\nThe sequence must be owned by a column on the table — note: not owner to. Although \\d shows sequences on the table, they may not belong to the table.\n\\d+ lzl; Table \u0026#34;public.lzl\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+------------------------------+---------+--------------+------------- a | integer | | | nextval(\u0026#39;lzl_seq\u0026#39;::regclass) | plain | | Use owned by to modify the sequence\u0026rsquo;s owning table:\n=\u0026gt; ALTER SEQUENCE lzl_seq OWNED BY lzl.a; ALTER SEQUENCE -- Check sequence owner information SELECT s.relname AS seq, n.nspname AS sch, t.relname AS tab, a.attname AS col FROM pg_class s JOIN pg_depend d ON d.objid=s.oid AND d.classid=\u0026#39;pg_class\u0026#39;::regclass AND d.refclassid=\u0026#39;pg_class\u0026#39;::regclass JOIN pg_class t ON t.oid=d.refobjid JOIN pg_namespace n ON n.oid=t.relnamespace JOIN pg_attribute a ON a.attrelid=t.oid AND a.attnum=d.refobjsubid WHERE s.relkind=\u0026#39;S\u0026#39; AND d.deptype=\u0026#39;a\u0026#39;; seq | sch | tab | col -------------------+--------+-------------+----- tableserial_a_seq | public | tableserial | a lzl_seq | public | lzl | a =\u0026gt; truncate table lzl RESTART IDENTITY; TRUNCATE TABLE M=\u0026gt; select relname,oid,relfilenode,relkind from pg_class where relname like \u0026#39;lzl%\u0026#39;; relname | oid | relfilenode | relkind ---------+--------+-------------+--------- lzl | 428363 | 428375 | r lzl_idx | 428366 | 428376 | i lzl_seq | 428367 | 428377 | S When a sequence is owned by a column on the table, explicitly specifying RESTART IDENTITY with truncate will restart that sequence, which also rebuilds the sequence. Sequences created via serial/bigserial are owned by the table and are dropped when the table is dropped; sequences not owned by a table are not dropped when the table is dropped.\nSummary of truncate rebuild characteristics:\nDirect truncate table rebuilds the table and indexes. truncate table + RESTART IDENTITY rebuilds (i.e., restarts) sequences that belong to this table. If a sequence doesn\u0026rsquo;t belong to this table, even if the column\u0026rsquo;s default is associated with the sequence, the sequence won\u0026rsquo;t be rebuilt. Source Code Analysis # TRUNCATE is also a utility command, and the entry function can be found quickly.\nExecuteTruncate in src/backend/commands/tablecmds.c is the entry function. The comments already explain that truncate must acquire an exclusive lock, check permissions and relation validity, and recursively check all tables that need to be truncated.\nvoid ExecuteTruncate(TruncateStmt *stmt) { ... /* * Open, exclusive-lock, and check all the explicitly-specified relations */ foreach(cell, stmt-\u0026gt;relations) { ... LOCKMODE lockmode = AccessExclusiveLock; // Level 8 lock ... rel = table_open(myrelid, NoLock); // Open table void ExecuteTruncate(TruncateStmt *stmt) { ... foreach(cell, stmt-\u0026gt;relations) { ... LOCKMODE\tlockmode = AccessExclusiveLock; // Level 8 lock ... /* open the relation, we already hold a lock on it */ rel = table_open(myrelid, NoLock); // Open table ... truncate_check_activity(rel); // Even with the lock, verify it\u0026#39;s not in use ... if (recurse) // Recursive execution { ... children = find_all_inheritors(myrelid, lockmode, NULL); // Find all inheritance children foreach(child, children) { ... // Above only checked the parent table, recursion checks children truncate_check_rel(RelationGetRelid(rel), rel-\u0026gt;rd_rel); truncate_check_activity(rel); rels = lappend(rels, rel); // Add to the list of rels to truncate relids = lappend_oid(relids, childrelid); ... } } // Recursion ends // truncate only on partitioned parent table? error directly else if (rel-\u0026gt;rd_rel-\u0026gt;relkind == RELKIND_PARTITIONED_TABLE) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg(\u0026#34;cannot truncate only a partitioned table\u0026#34;), errhint(\u0026#34;Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly.\u0026#34;))); } // Main function ExecuteTruncateGuts(rels, relids, relids_logged, stmt-\u0026gt;behavior, stmt-\u0026gt;restart_seqs); /* And close the rels */ foreach(cell, rels) { Relation\trel = (Relation) lfirst(cell); table_close(rel, NoLock); } } ExecuteTruncateGuts is called not only by the TRUNCATE command but also by the subscription side (publication/subscription can synchronize TRUNCATE).\nvoid ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged, DropBehavior behavior, bool restart_seqs) { ... rels = list_copy(explicit_rels); if (behavior == DROP_CASCADE) // If CASCADE option specified, extract all referencing relations { for (;;) { ... newrelids = heap_truncate_find_FKs(relids); // Find FKs if (newrelids == NIL) break;\t/* nothing else to add */ // No rels, exit directly foreach(cell, newrelids) { ... rel = table_open(relid, AccessExclusiveLock); // All rels acquire AccessExclusiveLock ereport(NOTICE, (errmsg(\u0026#34;truncate cascades to table \\\u0026#34;%s\\\u0026#34;\u0026#34;, RelationGetRelationName(rel)))); truncate_check_rel(relid, rel-\u0026gt;rd_rel); // Check if it\u0026#39;s a truncatable object — must be a data-storing table truncate_check_perms(relid, rel-\u0026gt;rd_rel); // Check permissions truncate_check_activity(rel); // Check if in use ... } } } ... if (restart_seqs) // Handle restart seq { foreach(cell, rels) { Relation\trel = (Relation) lfirst(cell); List\t*seqlist = getOwnedSequences(RelationGetRelid(rel)); ... // Only check sequence permissions if (!pg_class_ownercheck(seq_relid, GetUserId())) aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_SEQUENCE, RelationGetRelationName(seq_rel)); ... } } ... // Execute all BEFORE TRUNCATE triggers foreach(cell, rels) { ExecBSTruncateTriggers(estate, resultRelInfo); resultRelInfo++; } // Begin the actual truncate foreach(cell, rels) { ... // If it\u0026#39;s a partitioned parent table, do nothing if (rel-\u0026gt;rd_rel-\u0026gt;relkind == RELKIND_PARTITIONED_TABLE) continue; // Handle foreign tables if (rel-\u0026gt;rd_rel-\u0026gt;relkind == RELKIND_FOREIGN_TABLE) { ... } ... // If same transaction (may rollback), directly execute heap_truncate_one_rel without creating new relfilenode if (rel-\u0026gt;rd_createSubid == mySubid || rel-\u0026gt;rd_newRelfilenodeSubid == mySubid) { /* Immediate, non-rollbackable truncation is OK */ heap_truncate_one_rel(rel); } else { ... // Set NewRelfilenode RelationSetNewRelfilenode(rel, rel-\u0026gt;rd_rel-\u0026gt;relpersistence); heap_relid = RelationGetRelid(rel); // Same for toast toast_relid = rel-\u0026gt;rd_rel-\u0026gt;reltoastrelid; if (OidIsValid(toast_relid)) { Relation\ttoastrel = relation_open(toast_relid, AccessExclusiveLock); RelationSetNewRelfilenode(toastrel, toastrel-\u0026gt;rd_rel-\u0026gt;relpersistence); table_close(toastrel, NoLock); } ... // Rebuild indexes reindex_relation(heap_relid, REINDEX_REL_PROCESS_TOAST, \u0026amp;reindex_params); } pgstat_count_truncate(rel); // Update pgstat truncate count } ... // Reset sequences foreach(cell, seq_relids) { Oid\tseq_relid = lfirst_oid(cell); ResetSequence(seq_relid); } // Write WAL if (list_length(relids_logged) \u0026gt; 0) { ... } // Fire AFTER TRUNCATE triggers resultRelInfo = resultRelInfos; foreach(cell, rels) { ExecASTruncateTriggers(estate, resultRelInfo); resultRelInfo++; } ... } The ExecuteTruncateGuts function processes according to truncate options, with the following flow:\nFind all referencing foreign key tables based on CASCADE option Fire BEFORE TRUNCATE triggers Execute truncate If same transaction, don\u0026rsquo;t immediately create NewRelfilenode, directly call heap_truncate_one_rel for truncation If not same transaction, call RelationSetNewRelfilenode to create new NewRelfilenode reindex_relation function rebuilds indexes Reset sequences based on RESTART IDENTITY Write WAL log Fire AFTER TRUNCATE triggers Tracing further, there\u0026rsquo;s quite a bit of function nesting: RelationSetNewRelfilenode table_relation_set_new_filenode relation_set_new_filenode heapam_relation_set_new_filenode RelationCreateStorage Then to smgrcreate and smgr_create in src/backend/storage/smgr/smgr.c. The comment for smgr.c:\npublic interface routines to storage manager switch All file system operations in POSTGRES dispatch through these routines.\nAny file system operation goes through smgr (storage manager); at this point it becomes file system operations.\nReference # https://www.postgresql.org/docs/15/sql-truncate.html https://www.postgresql.org/docs/current/mvcc-caveats.html https://pgpedia.info/t/truncate.html https://www.orafaq.com/wiki/SQL_FAQ https://learnsql.com/blog/difference-between-truncate-delete-and-drop-table-in-sql/\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/a-brief-analysis-of-postgresql-truncate/","section":"Posts","summary":"Command Options # TRUNCATE [ TABLE ] [ ONLY ] name [ * ] [, ... ] [ RESTART IDENTITY | CONTINUE IDENTITY ] [ CASCADE | RESTRICT ] 1. ONLY: truncate only the specified table. When a table has inheritance children or child partitions, by default they are truncated together; ONLY can truncate just the inheritance parent table. Partitioned parent tables cannot specify ONLY.\n-- Cannot truncate only a partitioned parent table =\u003e truncate only parttable; ERROR: 42809: cannot truncate only a partitioned table HINT: Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly. LOCATION: ExecuteTruncate, tablecmds.c:1655 -- truncate only the inheritance parent table, only the parent is cleaned =\u003e truncate table only parenttable; TRUNCATE TABLE =\u003e select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ------------+------- childtable | 1 -- Directly truncate the inheritance parent table, child tables are also cleaned =\u003e truncate table parenttable; TRUNCATE TABLE =\u003e select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ----------+------- (0 rows) 2. RESTART IDENTITY CONTINUE IDENTITY: whether to reset sequences on columns. Default is CONTINUE.\n","title":"A Brief Analysis of PostgreSQL TRUNCATE","type":"posts"},{"content":" Slow Primary Key Update — Problem Analysis # A simple primary key update took over 1 second to execute. Due to high concurrency, the CPU was completely maxed out:\n2024-04-01 10:19:36.084 CST,\u0026#34;lzlopr\u0026#34;,\u0026#34;lzl\u0026#34;,158751,\u0026#34;10.33.78.149:51502\u0026#34;,66055a6b.26c1f,172,\u0026#34;UPDATE\u0026#34;,2024-03-28 19:54:19 CST,528/19816630,970251337,LOG,00000,\u0026#34;duration: 1218.688 ms plan: Query Text: update table_a set (omitted...）=$6 where column_id =$7 Update on table_a (cost=0.40..5.49 rows=1 width=2774) -\u0026gt; Index Scan using pk_id on table_a (cost=0.40..5.49 rows=1 width=2774) Index Cond: ((column_id)::text = $7)\u0026#34;,,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34;,\u0026#34;client backend\u0026#34; The SQL itself is very simple — an update with a condition on the primary key. Looking at the execution plan, it used the pk_id primary key index, so there was no problem with the plan itself; the issue wasn\u0026rsquo;t a plan change.\nLet\u0026rsquo;s rewrite the SQL (since it\u0026rsquo;s an UPDATE) and use explain (analyze,buffers) to compare the execution cost:\n=\u0026gt; explain (analyze,buffers) select * from table_a where column_id=\u0026#39;d4f713370e584820a9b15e2218ea436a\u0026#39;; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on table_a (cost=2.91..5.42 rows=1 width=1156) (actual time=55.052..123.354 rows=1 loops=1) Recheck Cond: ((column_id)::text = \u0026#39;d4f713370e584820a9b15e2218ea436a\u0026#39;::text) Heap Blocks: exact=1 Buffers: shared hit=13870 -\u0026gt; Bitmap Index Scan on pk_id (cost=0.00..2.91 rows=1 width=0) (actual time=3.464..3.465 rows=13866 loops=1) Index Cond: ((column_id)::text = \u0026#39;d4f713370e584820a9b15e2218ea436a\u0026#39;::text) Buffers: shared hit=24 Planning: Buffers: shared hit=4261 Planning Time: 11.028 ms Execution Time: 123.567 ms (11 rows) The actual execution plan is fine, but shared hit=13870 is clearly way too high. Normally, a primary key lookup shouldn\u0026rsquo;t scan that many pages. This strongly suggests table bloat.\nChecking table bloat:\n-- Table size \\dt Size | 525 MB -- Actual row count count | 827 -- Dead tuples from pg_stat_all_tables n_live_tup | 786 n_dead_tup | 657604 Only ~800 live tuples but 650K dead tuples! This explains why the primary key scan visited so many pages. But why weren\u0026rsquo;t the dead tuples reclaimed?\nWhen a table exceeds the default 20% modification threshold, autovacuum triggers vacuum to reclaim space. We can see in the logs that autovacuum was indeed being triggered:\n2024-04-01 14:13:46.649 CST,,,14081,,660a5099.3701,1,,2024-04-01 14:13:45 CST,259/17828993,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 2024-04-01 14:13:47.801 CST,,,14081,,660a5099.3701,2,,2024-04-01 14:13:45 CST,259/17828994,971045014,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34; system usage: CPU: user: 0.08 s, system: 0.01 s, elapsed: 1.15 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; 2024-04-01 14:14:46.673 CST,,,26136,,660a50d5.6618,1,,2024-04-01 14:14:45 CST,259/17829090,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 2024-04-01 14:14:47.833 CST,,,26136,,660a50d5.6618,2,,2024-04-01 14:14:45 CST,259/17829091,971049759,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34; system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.15 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; 2024-04-01 14:15:46.680 CST,,,40743,,660a5111.9f27,1,,2024-04-01 14:15:45 CST,259/17829164,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 2024-04-01 14:15:47.849 CST,,,40743,,660a5111.9f27,2,,2024-04-01 14:15:45 CST,259/17829165,971055464,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34; system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; 2024-04-01 14:16:46.677 CST,,,52599,,660a514d.cd77,1,,2024-04-01 14:16:45 CST,259/17829263,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 2024-04-01 14:16:47.844 CST,,,52599,,660a514d.cd77,2,,2024-04-01 14:16:45 CST,259/17829264,971061382,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34; system usage: CPU: user: 0.08 s, system: 0.03 s, elapsed: 1.16 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; 2024-04-01 14:17:46.699 CST,,,64858,,660a5189.fd5a,1,,2024-04-01 14:17:45 CST,234/16589539,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 2024-04-01 14:17:47.851 CST,,,64858,,660a5189.fd5a,2,,2024-04-01 14:17:45 CST,234/16589540,971066091,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34; system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; 2024-04-01 14:18:46.703 CST,,,78112,,660a51c5.13120,1,,2024-04-01 14:18:45 CST,259/17829409,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 2024-04-01 14:18:47.854 CST,,,78112,,660a51c5.13120,2,,2024-04-01 14:18:45 CST,259/17829410,971070390,LOG,00000,\u0026#34;automatic analyze of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34; system usage: CPU: user: 0.09 s, system: 0.02 s, elapsed: 1.15 s\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34;\tNot only was it triggered, but the interval was exactly 1 minute. The default autovacuum_naptime is 1 minute:\n\u0026gt;= show autovacuum_naptime ; autovacuum_naptime -------------------- 1min (1 row) We can conclude:\nautovacuum was successfully triggered Dead tuples either couldn\u0026rsquo;t be reclaimed fast enough — the dead tuples generated within 1 minute exceeded 20% (maybe 1 minute is too long); or they weren\u0026rsquo;t being reclaimed at all, guaranteeing the next autovacuum trigger Let\u0026rsquo;s look at the detailed autovacuum output:\n2024-04-01 10:22:44.648 CST,,,16827,,660a1a73.41bb,1,,2024-04-01 10:22:43 CST,170/16910186,0,LOG,00000,\u0026#34;automatic vacuum of table \u0026#34;\u0026#34;lzl.public.table_a\u0026#34;\u0026#34;: index scans: 0 pages: 0 removed, 48745 remain, 6 skipped due to pins, 0 skipped frozen tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077 buffer usage: 97603 hits, 0 misses, 5 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.028 MB/s system usage: CPU: user: 0.21 s, system: 0.22 s, elapsed: 1.41 s WAL usage: 4 records, 3 full page images, 5129 bytes\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;autovacuum worker\u0026#34; autovacuum triggered but reclaimed nothing: tuples: 0 removed, 744488 remain, 743666 are dead but not yet removable, oldest xmin: 969118077. oldest xmin represents the oldest transaction in the database — meaning there\u0026rsquo;s a long-running transaction. This is easy to find:\n\u0026gt;= select pid,usename,xact_start,state_change,wait_event,state,query from pg_stat_activity where state\u0026lt;\u0026gt;\u0026#39;idle\u0026#39; order by xact_start ; pid | usename | xact_start | state_change | wait_event | state | --------+------------+-------------------------------+-------------------------------+---------------------+---------------------+------------------------------------------------------------------------------ 164658 | phbdspsqp | 2024-04-01 08:36:57.275408+08 | 2024-04-01 08:36:57.299609+08 | DataFileRead | active | SELECT \u0026#34;minval\u0026#34;,\u0026#34;maxval\u0026#34; FROM (select min(ID) as minval,max(TRACK The long transaction was a SQL that had been running since around 8 AM that morning, for several hours. Even though it wasn\u0026rsquo;t on the same table, being the oldest xmin it still had an impact.\nAt this point the root cause is identified:\nTable A had frequent updates, high bloat risk A long transaction on table B prevented dead tuple reclamation on table A Table A\u0026rsquo;s update statements scanned excessive pages Solution:\nKill the long transaction: select pg_terminate_backend(164658) Manually vacuum or wait 1 minute (or less) for automatic vacuum: vacuum table_a After both steps were completed, checking dead tuples:\nn_live_tup | 707 n_dead_tup | 298 650K dead tuples have been cleaned up.\nChecking the execution plan again:\n=\u0026gt; explain (analyze,buffers) select * from table_a where column_id=\u0026#39;d4f713370e584820a9b15e2218ea436a\u0026#39;; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using pk_id on table_a (cost=0.40..5.44 rows=1 width=621) (actual time=0.026..0.029 rows=1 loops=1) Index Cond: ((column_id)::text = \u0026#39;d4f713370e584820a9b15e2218ea436a\u0026#39;::text) Buffers: shared hit=6 Planning Time: 0.057 ms Execution Time: 0.043 ms Shared hits down to just 6 — issue resolved.\nAdditionally, vacuum only reclaims dead tuples but does not shrink the table — the table remains the same size. Space can only be returned to the OS when new data reuses those pages, or through a repack/table rebuild:\nSize | 525 MB Bonus SQL Optimization — ORDER BY LIMIT # That long-running transaction SQL also had its own problems\u0026hellip; The business reported it ran fast a few days ago but took several hours today:\nexplain select min(ID) as minval,max(ID) as maxval from table_b where time_at \u0026gt;= to_timestamp(\u0026#39;2024-03-30 00:00:00\u0026#39;,\u0026#39;yyyy-MM-dd HH24:mi:ss\u0026#39;); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=4298.54..4298.55 rows=1 width=64) InitPlan 1 (returns $0) -\u0026gt; Limit (cost=0.70..2149.27 rows=1 width=32) -\u0026gt; Index Scan using pk_b on table_b (cost=0.70..1181490202.27 rows=549896 width=32) Index Cond: ((ID)::text IS NOT NULL) Filter: (time_at \u0026gt;= to_timestamp(\u0026#39;2024-03-30 00:00:00\u0026#39;::text, \u0026#39;yyyy-MM-dd HH24:mi:ss\u0026#39;::text)) InitPlan 2 (returns $1) -\u0026gt; Limit (cost=0.70..2149.27 rows=1 width=32) -\u0026gt; Index Scan Backward using pk_b on table_b table_b_1 (cost=0.70..1181490202.27 rows=549896 width=32) Index Cond: ((ID)::text IS NOT NULL) Filter: (time_at \u0026gt;= to_timestamp(\u0026#39;2024-03-30 00:00:00\u0026#39;::text, \u0026#39;yyyy-MM-dd HH24:mi:ss\u0026#39;::text)) The SQL is also simple — only one condition on a time column, with decent selectivity. However, this SQL did not use the time_at index but instead used the ID primary key index. This is the same LIMIT problem. Running ANALYZE is useless here — it\u0026rsquo;s better to rewrite the SQL.\nAfter rewriting, the result came back instantly:\nexplain select min(ID||\u0026#39;\u0026#39;) as minval,max(ID||\u0026#39;\u0026#39;) as maxval from table_b where time_at \u0026gt;= to_timestamp(\u0026#39;2024-03-30 00:00:00\u0026#39;,\u0026#39;yyyy-MM-dd HH24:mi:ss\u0026#39;) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1201418.86..1201418.87 rows=1 width=64) -\u0026gt; Index Scan using idx_time_at on table_b (cost=0.57..1195919.90 rows=549896 width=33) Index Cond: (time_at \u0026gt;= to_timestamp(\u0026#39;2024-03-30 00:00:00\u0026#39;::text, \u0026#39;yyyy-MM-dd HH24:mi:ss\u0026#39;::text)) This isn\u0026rsquo;t really an execution plan regression, because the plan didn\u0026rsquo;t change. A few days ago it had the same plan but ran fast — the reason is tied to data distribution and the LIMIT mechanism: when data is quickly found, it returns immediately (which is why the optimizer chose the primary key index); when it\u0026rsquo;s \u0026ldquo;unlucky\u0026rdquo; and the matching data is far away, it takes a very long time.\nSummary # A classic case:\nA small table with frequent updates A long transaction preventing dead tuple reclamation The long transaction itself was caused by an index selection problem due to sorting and LIMIT operations (ORDER BY, MAX/MIN, GROUP can all trigger this) One incident, three classic PostgreSQL knowledge points — quite representative.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/a-classic-case-of-long-transaction-table-bloat-and-limit-issues/","section":"Posts","summary":"Slow Primary Key Update — Problem Analysis # A simple primary key update took over 1 second to execute. Due to high concurrency, the CPU was completely maxed out:\n2024-04-01 10:19:36.084 CST,\"lzlopr\",\"lzl\",158751,\"10.33.78.149:51502\",66055a6b.26c1f,172,\"UPDATE\",2024-03-28 19:54:19 CST,528/19816630,970251337,LOG,00000,\"duration: 1218.688 ms plan: Query Text: update table_a set (omitted...）=$6 where column_id =$7 Update on table_a (cost=0.40..5.49 rows=1 width=2774) -\u003e Index Scan using pk_id on table_a (cost=0.40..5.49 rows=1 width=2774) Index Cond: ((column_id)::text = $7)\",,,,,,,,,\"PostgreSQL JDBC Driver\",\"client backend\" The SQL itself is very simple — an update with a condition on the primary key. Looking at the execution plan, it used the pk_id primary key index, so there was no problem with the plan itself; the issue wasn’t a plan change.\n","title":"A Classic Case of Long Transaction, Table Bloat, and LIMIT Issues","type":"posts"},{"content":"PostgreSQL Transactions\nTo guarantee ACID properties, an RDBMS must implement concurrency control. PostgreSQL, like Oracle and MySQL (InnoDB), uses MVCC (Multi-Version Concurrency Control) for concurrency control. MVCC works by continuously generating new versions of objects as data changes while allowing queries to access a bounded range of older versions. It captures a snapshot of data at a given point in time and selects one version to read.\nOracle and MySQL both use undo segments to record old versions of objects. PostgreSQL has no undo. Instead, during DML operations it writes historical data directly into the original table (UPDATE creates a new row, DELETE marks the row) and records additional columns — xmin and xmax — in the table to store transaction IDs. By comparing transaction IDs and other metadata, PostgreSQL implements its MVCC mechanism.\nAmong relational databases, PostgreSQL\u0026rsquo;s transaction mechanism is truly distinctive. Understanding it is key to grasping how PostgreSQL operates under the hood.\nTransaction Isolation Levels # Most relational databases support multiple transaction isolation levels. Under different isolation levels, concurrent transaction behavior varies.\nSetting the Transaction Isolation Level # PostgreSQL supports four isolation levels (though only three are actually effective):\n{ SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } Isolation level parameters\ndefault_transaction_isolation: sets the default isolation level for all transactions globally.\ntransaction_isolation: displays the isolation level of the current session.\nThe default isolation level is read committed.\nChanging the global default isolation level\nModify the default_transaction_isolation parameter and reload:\npostgres=# alter system set default_transaction_isolation to \u0026#39;serializable\u0026#39;; ALTER SYSTEM postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) postgres=# show transaction_isolation; transaction_isolation ----------------------- serializable After the change, every new transaction will use the default_transaction_isolation isolation level.\nSetting the session isolation level\nNote: transaction_isolation only displays the current session\u0026rsquo;s isolation level. This parameter cannot be modified directly.\nlzldb=# alter system set transaction_isolation to \u0026#39;REPEATABLE READ\u0026#39;; ERROR: parameter \u0026#34;transaction_isolation\u0026#34; cannot be changed Use SET SESSION to change the session\u0026rsquo;s isolation level:\nlzldb=# SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL REPEATABLE READ; SET lzldb=# show transaction_isolation ; -[ RECORD 1 ]---------+---------------- transaction_isolation | repeatable read Setting the transaction-level isolation level\nPostgreSQL allows specifying the isolation level for an individual transaction. You can set it when starting the transaction:\nlzldb=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; BEGIN lzldb=# start TRANSACTION ISOLATION LEVEL REPEATABLE READ; START TRANSACTION Or use set transaction after starting a transaction:\nlzldb=# begin; BEGIN lzldb=*# set transaction ISOLATION LEVEL REPEATABLE READ; SET ANSI-92 Transaction Isolation Levels # The ANSI SQL-92 standard defines four isolation levels:\nSerializable\nAll transactions in the system execute serially, without interfering with each other. Executing transactions one after another avoids all data inconsistency scenarios.\nEarly implementations used exclusive locks to control concurrent transactions. Serial execution caused queuing and dramatically reduced system concurrency. After ANSI-92, more serializable implementation methods emerged, greatly improving both concurrency and performance.\nRepeatable Read\nOnce a transaction begins, all data read during the transaction cannot be modified by other transactions. Repeatable Read is MySQL\u0026rsquo;s default isolation level.\nNote: in ANSI SQL, Repeatable Read can experience phantom reads, but PostgreSQL\u0026rsquo;s Repeatable Read does not.\nRead Committed\nA transaction can read data committed by other transactions. If a transaction reads a piece of data multiple times and that data happens to be modified and committed by another transaction in between, the current transaction will see different values for the same data. This is the default isolation level for both Oracle and PostgreSQL.\nAt this isolation level, both \u0026ldquo;non-repeatable read\u0026rdquo; and \u0026ldquo;phantom read\u0026rdquo; scenarios can occur.\nRead Uncommitted\nA transaction can read data that has been modified but not yet committed by other transactions. Since uncommitted data can still be rolled back, reading such data leads to \u0026ldquo;dirty reads.\u0026rdquo;\nAt this isolation level, \u0026ldquo;dirty read\u0026rdquo; scenarios can occur.\nPostgreSQL does not have a Read Uncommitted isolation level. Setting Read Uncommitted is treated as Read Committed.\nStandard concurrency phenomena and isolation level matrix\nIsolation Level Dirty Read Non-repeatable Read Phantom Read Read Uncommitted Possible Possible Possible Read Committed Impossible Possible Possible Repeatable Read Impossible Impossible Possible Serializable Impossible Impossible Impossible PostgreSQL concurrency phenomena and isolation level matrix\nIsolation Level Dirty Read Non-repeatable Read Phantom Read Read Uncommitted Impossible Possible Possible Read Committed Impossible Possible Possible Repeatable Read Impossible Impossible Impossible Serializable Impossible Impossible Impossible A Brief History of Transaction Isolation Levels # The isolation levels and anomaly phenomena defined by ANSI SQL-92 have had a profound impact on the database industry. Even today, over 30 years later, most engineers\u0026rsquo; understanding of transaction isolation levels still revolves around them, and many real-world database isolation level implementations still follow them. However, the post-ANSI-92 era has seen much discussion and even criticism regarding isolation levels. Here is a summary of the key historical developments:\n1992: The database industry was in a chaotic state regarding transactions, so ANSI defined the SQL-92 standard — the widely known 4 isolation levels and 4 anomaly phenomena.\n1995: Snapshot Isolation and other isolation levels were proposed, along with more anomaly phenomena. Microsoft engineers proposed the Snapshot Isolation level and criticized ANSI SQL-92, noting that the standard was vaguely defined and many isolation levels and anomalies were left undefined. See A Critique of ANSI SQL Isolation Levels. By this point, there were more than 4 isolation levels and more anomaly phenomena, including write skew.\n1999: Due to the proliferation of lock-based isolation levels, Atul Adya\u0026rsquo;s paper organized these phenomena and mapped the various isolation levels back to ANSI SQL-92 based on anomaly phenomena and functionality.\n2005: Because most databases claimed to be serializable but were actually Snapshot Isolation, Alan Fekete et al proposed Making Snapshot Isolation Serializable — achieving serializability on top of Snapshot Isolation by eliminating its anomalies.\n2008: Fekete extended serializability and proposed a database-level implementation called Serializable Snapshot Isolation (SSI).\n2012: PostgreSQL became the first database to implement SSI. See the PostgreSQL SSI implementation paper.\nIsolation levels and anomaly phenomena from the 1995 Critique of ANSI SQL Isolation Levels:\nIsolation Levels Supported by Various Databases # Many databases claim \u0026ldquo;full ACID\u0026rdquo; compliance, but without serializability, ACID cannot be fully realized (especially consistency). Yet many databases claim ACID support even without serializability. The truth is, most do not fully implement it — including the veteran Oracle.\nSerializable # There are many misconceptions about serializability.\nThe meaning of serializable: if each transaction is itself correct (satisfying certain integrity conditions), then any schedule that executes those transactions serially is also correct (the transactions still satisfy their conditions). \u0026ldquo;Serial\u0026rdquo; means transactions do not overlap in time and cannot interfere with each other — they are fully isolated.\nIn the 1970s, serializability was achieved through Strict Two-Phase Locking (SS2PL), where reads and writes block each other until the transaction ends. SS2PL sacrifices high availability but eliminates anomaly phenomena.\nBeyond SS2PL, there are other ways to achieve serializability, such as Serializable Snapshot Isolation (SSI).\nTo guarantee no anomalies, serializability sacrifices some concurrency (how much depends on the implementation), but it can truly guarantee data consistency (the \u0026ldquo;C\u0026rdquo; in ACID). In other words, databases that do not implement serializability do not fully support ACID.\nSerializability has been mathematically proven achievable, but the real database world is somewhat \u0026ldquo;abnormal.\u0026rdquo; In practice, serializability is the highest transaction isolation level and the one strongly recommended by academics and experts. However, the vast majority of databases run at Read Committed or Snapshot Isolation.\nWhy Do Weaker Isolation Levels Cause Academic Problems but Few Real-World Disasters? # Anomalies in non-serializable isolation levels generally require high concurrency. Low-concurrency databases rarely encounter problems.\nWhen anomalies do occur, some applications may not detect them or may not consider them important.\nIt is possible that data becomes anomalous but the application simply returns an error and enters exception-handling logic.\nCost is too high. Not only is the development cost of serializable isolation high for the database, but applications also need to adapt. Simply understanding this complex theory is no easy task.\nHigher isolation levels lose some performance. Extensive rework may not be worth it; applications must choose between \u0026ldquo;high concurrency\u0026rdquo; and \u0026ldquo;freedom from anomalies.\u0026rdquo;\nBusiness logic is built around mechanisms, not rules. Applications have somewhat adapted to the anomalies of weaker isolation levels, especially Read Committed or Snapshot Isolation.\nSnapshot Isolation # ANSI SQL-92 did not define Snapshot Isolation (SI). This isolation level emerged as the database industry evolved.\nQuoting the Wikipedia definition: a transaction executing under Snapshot Isolation operates on a snapshot of the database taken at the start of the transaction. When the transaction ends, it will only commit successfully if the values it updated have not been externally changed since the snapshot was taken. Write conflicts thus cause transaction aborts.\nAs the name implies, Snapshot Isolation uses snapshots. It exists in databases that use MVCC, where the multi-version concurrency mechanism supports concurrent transaction execution.\nThe 1992 ANSI SQL-92 standard was defined based on database locks, so it did not define Snapshot Isolation. The concept only emerged with the 1995 Critique.\nSerializable Snapshot Isolation # Due to the widespread adoption of Snapshot Isolation and the academic goal that databases should achieve serializability, Serializable Snapshot Isolation (SSI) was born. As the name suggests, it achieves serializability on top of Snapshot Isolation.\nBecause of the ambiguity of the ANSI-92 standard, although Snapshot Isolation was not defined, many databases actually use it. Snapshot Isolation also has certain anomaly phenomena (including write skew), and SSI was created to resolve them.\nMainstream databases implement concurrency control via S2PL or MVCC. Under S2PL, write operations block reads and writes from other transactions, so there is no write skew. MVCC, however, allows reads and writes not to block each other — only write-write conflicts. In concurrent read-write patterns, this leads to write skew. Starting from PostgreSQL 9.1, SSI has been embedded into Snapshot Isolation (PostgreSQL only has Snapshot Isolation, even at the serializable level), resolving write skew and other anomalies.\nWrite Skew # When certain conflicts form a cycle, serialization anomalies occur. One of the easier ones to understand is write skew.\nWrite skew only happens in read-write patterns (not write-write or write-read), and only under concurrent conditions. A dependency cycle forms when a preceding transaction\u0026rsquo;s write depends on a later transaction\u0026rsquo;s write.\nThere are many real-world cases of write skew. Let\u0026rsquo;s understand it through the classic black-and-white ball problem:\nA bag contains 10 balls: 5 white and 5 black. Two transactions, P and Q, are running. P changes all black balls to white; Q changes all white balls to black. There are two possible serial executions: P then Q, or Q then P. In both cases, the final result is either 10 white balls or 10 black balls. However, Snapshot Isolation allows another outcome:\nTransaction P picks up 5 black balls Transaction Q picks up 5 white balls Transaction P changes all the balls in hand to white and puts them back Transaction Q changes all the balls in hand to black and puts them back Now the bag still has 5 black and 5 white balls — an outcome impossible in any serial execution. Yet this is valid under Snapshot Isolation: each transaction maintains a consistent view of the database, and its write set does not overlap with any concurrent transaction\u0026rsquo;s write set. Hence, the black and white balls are swapped.\nThe black-and-white ball problem illustrates: the result under Snapshot Isolation is inconsistent with the result under serial execution. Write skew occurs under Snapshot Isolation, and the data outcome does not match expectations.\nSSI in PostgreSQL # PostgreSQL was the first database to implement SSI. Here is the black-and-white ball example using the Wikipedia code:\ncreate table dots ( id int not null primary key, color text not null ); insert into dots with x(id) as (select generate_series(1,10)) select id, case when id % 2 = 1 then \u0026#39;black\u0026#39; else \u0026#39;white\u0026#39; end from x; set default_transaction_isolation = \u0026lsquo;serializable\u0026rsquo;; set default_transaction_isolation = \u0026lsquo;serializable\u0026rsquo;; begin; update dots set color = \u0026lsquo;black\u0026rsquo; where color = \u0026lsquo;white\u0026rsquo;; begin; update dots set color = \u0026lsquo;white\u0026rsquo; where color = \u0026lsquo;black\u0026rsquo;; commit commit (PostgreSQL SSI: first committer succeeds, second throws an error) ERROR: could not serialize access due to read/write dependencies among transactions DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt. HINT: The transaction might succeed if retried. (At Read Committed and Repeatable Read, no error is thrown; the black and white balls simply swap colors. Test results omitted.)\nStrict Two-Phase Locking (S2PL) can also achieve serializability, but S2PL requires heavy read-write locks held until transaction commit. S2PL severely impacts concurrency performance, and users generally won\u0026rsquo;t accept reads and writes blocking each other, so PostgreSQL does not use S2PL.\nSSI is an alternative approach to serializability. It still uses Snapshot Isolation but additionally checks for anomaly phenomena. The two approaches also handle anomalies differently: when one occurs, S2PL blocks transactions, while SSI aborts a transaction to break the cycle.\nOne reason people avoid serializability is that it supposedly reduces database performance. This is understandable — SSI, which performs \u0026ldquo;anomaly checks,\u0026rdquo; must be slower than weaker isolation levels that do no such checking. However, with advances in SSI implementation theory and PostgreSQL\u0026rsquo;s optimizations for read-only transactions, SSI\u0026rsquo;s performance is now on par with SI.\nSerializability greatly simplifies applications\u0026rsquo; consistency concerns. PostgreSQL 9.1 has implemented SSI with optimizations. Let\u0026rsquo;s hope applications will one day truly adopt the serializable isolation level.\nTransaction Isolation Level References # https://wiki.postgresql.org/wiki/SSI\nhttps://en.wikipedia.org/wiki/Serializability\nhttps://en.wikipedia.org/wiki/Snapshot_isolation\nhttps://justinjaffray.com/what-does-write-skew-look-like/\nhttp://www.bailis.org/blog/when-is-acid-acid-rarely/\nhttps://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf — 1995 paper on SI isolation levels and critique of SQL-92\nhttps://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf — SSI paper\nhttps://drkp.net/papers/ssi-vldb12.pdf — PostgreSQL SSI implementation\nhttps://ristret.com/s/f643zk/history_transaction_histories — History of transaction isolation levels\nTransaction Processing # Transaction Blocks # Transactions can be implicit or explicit. An implicit transaction is a standalone SQL statement that auto-commits upon completion. An explicit transaction requires an explicit declaration; multiple SQL statements grouped together form a transaction block.\nTransaction blocks begin with begin, begin transaction, or start transaction.\nThey end with COMMIT, END, or ABORT, ROLLBACK, where COMMIT=END and ABORT=ROLLBACK.\nBEGIN; select * from lzl1 limit 1; update lzl1 set a=2; END; If an error occurs during a transaction block, the transaction can only be rolled back due to atomicity:\nlzldb=# begin; BEGIN lzldb=*# select * from lzl2; ERROR: relation \u0026#34;lzl2\u0026#34; does not exist LINE 1: select * from lzl2; ^ lzldb=!# commit; ROLLBACK Transaction Processing Functions # Transaction processing functions are organized into three layers: top-level transaction functions, middle-level transaction functions, and bottom-level transaction functions.\nTop-level transaction functions handle transaction block commands like BEGIN, COMMIT, ROLLBACK, SAVEPOINT, etc.:\nBeginTransactionBlock Start a transaction block EndTransactionBlock End a transaction block UserAbortTransactionBlock User-initiated transaction abort DefineSavepoint Create a savepoint RollbackToSavepoint Roll back to a savepoint ReleaseSavepoint Release a savepoint Middle-level transaction functions: every SQL statement calls middle-level functions before and after execution, including after detecting an exception:\nStartTransactionCommand Start a transaction command CommitTransactionCommand Complete a transaction command (not commit) AbortCurrentTransaction Abort the current transaction Bottom-level transaction functions: the actual transaction processing functions, responsible for maintaining transaction state, allocating and reclaiming transaction resources, etc.:\nStartTransaction Start a transaction CommitTransaction Commit a transaction AbortTransaction Rollback/abort a transaction CleanupTransaction Clean up a transaction StartSubTransaction Start a subtransaction CommitSubTransaction Commit a subtransaction AbortSubTransaction Rollback/abort a subtransaction CleanupSubTransaction Clean up a subtransaction These functions are fairly easy to distinguish. Aside from a few special functions (top-level savepoint-related, middle-level abort function), the three layers are organized as: *Block (transaction block functions), *Command (command functions), and *Transaction (actual transaction processing functions). Savepoints/subtransactions are treated as transaction-block-level functions (subtransactions can be rolled back within a transaction block, so placing them at the block level makes sense), and abort is treated as a command-level function.\nTransaction Block States # Top-level and middle-level functions jointly control the transaction block state; bottom-level functions control the transaction state.\nBoth transaction block states and transaction states are in src/backend/access/transam/xact.c:\ntypedef enum TBlockState { /* states not in a transaction block */ TBLOCK_DEFAULT, /* idle state; entering or exiting a transaction returns to this state */ TBLOCK_STARTED, /* just entered a transaction block; transitions from TBLOCK_DEFAULT; short-lived */ /* transaction block states */ TBLOCK_BEGIN, /* start a transaction block; at this point data block is started, entering block-level state */ TBLOCK_INPROGRESS, /* active transaction; after BEGIN, the block stays in this state until transaction ends */ TBLOCK_IMPLICIT_INPROGRESS, /* active transaction with an implicit BEGIN */ TBLOCK_PARALLEL_INPROGRESS, /* active transaction in parallel execution */ TBLOCK_END, /* received COMMIT command */ TBLOCK_ABORT, /* transaction failed, waiting for ROLLBACK */ TBLOCK_ABORT_END, /* transaction failed, received ROLLBACK */ TBLOCK_ABORT_PENDING, /* active transaction, received ROLLBACK */ TBLOCK_PREPARE, /* active transaction, received PREPARE (explicit 2PC) */ /* subtransaction states (still transaction-block level) */ TBLOCK_SUBBEGIN, /* start a subtransaction */ TBLOCK_SUBINPROGRESS, /* active subtransaction */ TBLOCK_SUBRELEASE, /* received RELEASE (release savepoint) */ TBLOCK_SUBCOMMIT, /* parent transaction COMMIT while subtransaction is still running (SUBINPROGRESS) */ TBLOCK_SUBABORT, /* failed subtransaction, waiting for rollback command */ TBLOCK_SUBABORT_END, /* failed subtransaction, received rollback command */ TBLOCK_SUBABORT_PENDING, /* active subtransaction, received rollback command */ TBLOCK_SUBRESTART, /* active subtransaction, received rollback to command */ TBLOCK_SUBABORT_RESTART /* failed subtransaction, received ROLLBACK TO command */ } TBlockState; Most states are self-explanatory. A note on rollback vs. abort: their subsequent behavior is similar — both need to clean up transaction resources and exit the current transaction. Yet PostgreSQL separates them into two behaviors with two states: TBLOCK_ABORT and TBLOCK_ABORT_END (and similarly for subtransactions). Why?\nsrc/backend/access/transam/README offers a detailed explanation:\nScenario 1 Scenario 2 1) User types BEGIN 1) User types BEGIN 2) User executes some commands 2) User executes some commands 3) User doesn\u0026rsquo;t like what she sees, types ABORT 3) The transaction system aborts for some reason (syntax error, etc.) In Scenario 1, we want to abort the transaction and return to the default state.\nIn Scenario 2, more commands may follow that are still part of the current transaction block. We must ignore these commands until we see COMMIT or ROLLBACK.\nAbortCurrentTransaction handles internal transaction aborts; UserAbortTransactionBlock handles user-initiated aborts. Both rely on AbortTransaction to do all the real work. The only difference is what state we enter after AbortTransaction finishes:\n* AbortCurrentTransaction leaves us in TBLOCK_ABORT\n* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END\nBottom-level transaction abort processing has two phases:\n* As soon as we realize the transaction has failed, AbortTransaction is executed. This should release all shared resources (locks, etc.) to avoid unnecessarily increasing latency for other backends.\n* When we finally see the user\u0026rsquo;s COMMIT or ROLLBACK, CleanupTransaction is executed; this function cleans up resources and gets us completely out of the transaction. In particular, we cannot destroy TopTransactionContext before this point.\nTransaction States # Transaction states are straightforward (note: these are different from transaction block states):\ntypedef enum TransState { TRANS_DEFAULT, /* idle */ TRANS_START, /* transaction started */ TRANS_INPROGRESS, /* active transaction */ TRANS_COMMIT, /* transaction commit */ TRANS_ABORT, /* abort transaction */ TRANS_PREPARE /* prepare transaction (2PC) */ } TransState; Transaction State Flow # Each command in a transaction block calls transaction functions, which in turn transition the transaction and transaction block states.\nLet\u0026rsquo;s use the simplest transaction block as an example (from the README):\n1)BEGIN 2)SELECT * FROM foo 3)INSERT INTO foo VALUES (...) 4)COMMIT Command call relationships:\n/ StartTransactionCommand; -- middle-level: start transaction command / StartTransaction; -- bottom-level: actually start the transaction 1)\u0026lt; ProcessUtility; -- ProcessUtility handles the BEGIN command \\ BeginTransactionBlock; -- top-level: start transaction block \\ CommitTransactionCommand; -- middle-level: complete command / StartTransactionCommand; -- middle-level: start transaction command 2) / PortalRunSelect; -- execute SELECT statement \\ CommitTransactionCommand; -- middle-level: complete command \\ CommandCounterIncrement; -- middle-level: command counter increment / StartTransactionCommand; -- middle-level: start transaction command 3) / ProcessQuery; -- execute INSERT statement \\ CommitTransactionCommand; -- middle-level: complete command \\ CommandCounterIncrement; -- command counter +1 / StartTransactionCommand; -- middle-level: start transaction command / ProcessUtility; -- ProcessUtility handles COMMIT command 4) \u0026lt; EndTransactionBlock; -- top-level: end transaction block \\ CommitTransactionCommand; -- middle-level: complete command \\ CommitTransaction; -- bottom-level: actually commit the transaction Every command in a transaction block begins with the middle-level StartTransactionCommand and ends with CommitTransactionCommand. Between these two middle-level functions is where the actual command processing occurs. The transaction block state for 2) SELECT and 3) INSERT is TBLOCK_INPROGRESS. The state transitions for BEGIN and COMMIT:\nTransaction Function References # PostgreSQL Internals (book)\nsrc/backend/access/transam/README\nTransaction ID # Every transaction in PostgreSQL is assigned a transaction ID. Transaction IDs come in two forms: virtual transaction IDs and persistent transaction IDs. Understanding transaction IDs is crucial for grasping transactions, data visibility, transaction ID wraparound, and more.\nVirtual Transaction ID # Read-only transactions are not assigned a transaction ID — transaction IDs are a precious resource. A simple SELECT, for instance, won\u0026rsquo;t consume one. However, to identify transactions for purposes such as shared locks, a non-persistent transaction ID is needed. This is the virtual transaction ID (VXID).\nVXID consists of two parts: a backend ID and a backend-local counter.\nSource: src/include/storage/lock.h\ntypedef struct { BackendId backendId; /* backendId from PGPROC */ LocalTransactionId localTransactionId; /* lxid from PGPROC */ } VirtualTransactionId; (PGPROC is a structure storing process information; we\u0026rsquo;ll cover it later.)\nYou can see VXID in pg_locks. Querying pg_locks itself is a SQL statement, so it generates a VXID:\nlzldb=# begin; BEGIN lzldb=*# select locktype,virtualxid,virtualtransaction,mode from pg_locks; locktype | virtualxid | virtualtransaction | mode ------------+------------+--------------------+----------------- relation\t| | 4/16 | AccessShareLock virtualxid | 4/16 | 4/16 | ExclusiveLock (2 rows) lzldb=*# savepoint p1; SAVEPOINT lzldb=*# select locktype,virtualxid,virtualtransaction,mode from pg_locks; locktype | virtualxid | virtualtransaction | mode ------------+------------+--------------------+----------------- relation | | 4/16 | AccessShareLock virtualxid | 4/16 | 4/16 | ExclusiveLock lzldb=*# rollback; ROLLBACK lzldb=# select locktype,virtualxid,virtualtransaction,mode from pg_locks; locktype | virtualxid | virtualtransaction | mode ------------+------------+--------------------+----------------- relation | | 4/17 | AccessShareLock virtualxid | 4/17 | 4/17 | ExclusiveLock After \\q (disconnect) and immediately logging back in, the counter continues: 4/19.\nOpening another window gives backendID+1:\nlzldb=# select locktype,virtualxid,virtualtransaction,mode from pg_locks; locktype | virtualxid | virtualtransaction | mode ------------+------------+--------------------+----------------- relation | | 5/3 | AccessShareLock virtualxid | 5/3 | 5/3 | ExclusiveLock From these tests we can observe:\nThe VXID\u0026rsquo;s backend ID is not the actual process PID; it\u0026rsquo;s simply an incrementing number. Both the VXID\u0026rsquo;s backend ID and command counter are incrementing. Subtransactions do not have their own VXID; they use the parent transaction\u0026rsquo;s VXID. VXID also has wraparound, but it\u0026rsquo;s not a serious issue since it isn\u0026rsquo;t persisted — after an instance restart, VXID starts counting from scratch. Persistent Transaction ID # 32-bit TransactionId # When a data-modifying transaction begins, the transaction manager assigns it a unique identifier: TransactionId. TransactionId is a 32-bit unsigned integer, capable of storing 2^32 = 4,294,967,296 — about 4.2 billion — transactions. The range of a 32-bit unsigned integer is 0 ~ 2^32 - 1.\nThree special transaction IDs\nsrc/include/access/transam.h defines several special transaction IDs:\n#define InvalidTransactionId ((TransactionId) 0) #define BootstrapTransactionId ((TransactionId) 1) #define FrozenTransactionId ((TransactionId) 2) #define FirstNormalTransactionId ((TransactionId) 3) #define MaxTransactionId ((TransactionId) 0xFFFFFFFF) 0: Invalid TransactionId 1: Bootstrap Transaction ID, used only during database initialization. Older than all normal transactions. 2: Frozen Transaction ID. Older than all normal transactions. #define TransactionIdIsNormal(xid) ((xid) \u0026gt;= FirstNormalTransactionId) A transaction ID \u0026gt;= 3 is a normal transaction ID.\nThe maximum transaction ID, MaxTransactionId, is 0xFFFFFFFF = 4,294,967,295 = 2^32 - 1.\nSo the allocatable range for normal transaction IDs is: 3 ~ 2^32 - 1.\n64-bit FullTransactionId # Transaction IDs increment sequentially. PostgreSQL has used 32-bit transaction IDs for a long time. Before PostgreSQL 7.2, when the 32-bit transaction ID was exhausted, you had to dump and restore the database. A 64-bit transaction ID, on the other hand, is practically inexhaustible. The source defines a 64-bit FullTransactionId as a struct:\n/* *A 64-bit value containing an epoch and a TransactionId. *It is wrapped in a struct to prevent implicit conversion to TransactionId. *Not all values represent valid normal XIDs. */ typedef struct FullTransactionId { uint64 value; } FullTransactionId; The 64-bit value consists of an epoch and a 32-bit TransactionId, converted via these functions:\n#define EpochFromFullTransactionId(x)\t((uint32) ((x).value \u0026gt;\u0026gt; 32)) #define XidFromFullTransactionId(x)\t((uint32) (x).value) The epoch is FullTransactionId shifted right 32 bits; the XID (TransactionId) is FullTransactionId modulo 2^32. This is like treating the 32-bit TransactionId as a \u0026ldquo;circle\u0026rdquo; that loops, while the 64-bit FullTransactionId is a \u0026ldquo;line\u0026rdquo; that keeps growing, nearly inexhaustible.\nA full transaction ID can exceed 2^32:\nTransaction ID Assignment # Let\u0026rsquo;s run a few experiments to see how transaction IDs are assigned. We\u0026rsquo;ll use two functions that return transaction IDs:\npg_current_xact_id(): returns the current transaction ID; if the current transaction has not yet been assigned one, it allocates one. (In pg12 and earlier, use txid_current().)\npg_current_xact_id_if_assigned(): returns the current transaction ID; if the current transaction has not yet been assigned one, returns NULL. (In pg12 and earlier, use txid_current_if_assigned().)\nTransaction IDs are assigned sequentially:\nlzldb=# select pg_current_xact_id(); pg_current_xact_id -------------------- 612 lzldb=# select pg_current_xact_id(); pg_current_xact_id -------------------- 613 lzldb=# select pg_current_xact_id(); pg_current_xact_id -------------------- 614 BEGIN does not immediately allocate a transaction ID:\nlzldb=# begin; -- explicitly start a transaction BEGIN lzldb=*# select pg_current_xact_id_if_assigned () ; -- BEGIN does not immediately allocate a transaction ID pg_current_xact_id_if_assigned -------------------------------- (1 row) lzldb=*# select * from lzl1; -- query immediately after BEGIN a --- (0 rows) lzldb=*# select pg_current_xact_id_if_assigned () ; -- queries do not allocate transaction IDs pg_current_xact_id_if_assigned -------------------------------- (1 row) lzldb=*# insert into lzl1 values(1); -- insert data, a data change INSERT 0 1 lzldb=*# select pg_current_xact_id_if_assigned () ; -- the first non-query statement after BEGIN allocates a transaction ID pg_current_xact_id_if_assigned -------------------------------- 611 lzldb=*# commit; COMMIT lzldb=# select xmin, pg_current_xact_id_if_assigned () from lzl1; -- the INSERT transaction writes to xmin xmin | pg_current_xact_id_if_assigned ------+-------------------------------- 611 Some records in system catalogs were assigned BootstrapTransactionId=1 during database initialization:\npostgres=# select xmin,count(*) from pg_class where xmin=1 group by xmin; xmin | count ------+------- 1 | 184 Conclusions from the experiments:\nDuring database initialization, the special transaction ID 1 is assigned, visible in system catalogs. Transaction IDs are assigned incrementally. BEGIN does not immediately allocate a transaction ID; the first non-query statement after BEGIN allocates one. When a transaction inserts a tuple, the transaction\u0026rsquo;s txid is written into the tuple\u0026rsquo;s xmin. Transaction ID Comparison # PostgreSQL compares the age of transactions by their transaction IDs. src/backend/access/transam/transam.c defines four comparison functions: \u0026lt;, \u0026lt;=, \u0026gt;, \u0026gt;=:\nbool TransactionIdPrecedes() bool TransactionIdPrecedesOrEquals() bool TransactionIdFollows() bool TransactionIdFollowsOrEquals() They are similar. Let\u0026rsquo;s examine TransactionIdPrecedes() as the representative:\nbool TransactionIdPrecedes(TransactionId id1, TransactionId id2) { /* * If either ID is a permanent XID then we can just do unsigned * comparison. If both are normal, do a modulo-2^32 comparison. */ int32 diff; if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2)) return (id1 \u0026lt; id2); diff = (int32) (id1 - id2); return (diff \u0026lt; 0); } Key points from this source code:\nTransactionIdIsNormal() is a macro defined in the header to check for normal transactions. FirstNormalTransactionId is the constant 3. So a normal transaction ID is \u0026gt;= 3. #define TransactionIdIsNormal(xid) ((xid) \u0026gt;= FirstNormalTransactionId) int32 is a signed integer: the first bit being 0 means positive, 1 means negative. Range: -2^31 ~ 2^31 - 1. Integer overflow: when a value exceeds the storage range (e.g., 2^31 barely overflows for int32), the value wraps around. The transaction ID comparison code can be understood in two parts:\nNon-normal transaction ID comparison:\nif (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2)) return (id1 \u0026lt; id2); When id1=2, id2=100: return(2\u0026lt;100), precedes is true — the normal transaction is newer.\nWhen id1=100, id2=2: return(100\u0026lt;2), precedes is false — the normal transaction is newer.\nSo, txid 1 and 2 are older than normal transactions.\nNormal transaction ID comparison:\ndiff = (int32) (id1 - id2); return (diff \u0026lt; 0); id1 - id2 can be negative, so diff cannot be unsigned int. It must be cast to signed int. Now the crucial part:\nSince int32 ranges from -2^31 to 2^31 - 1:\nWhen id1 = 2^31 + 99, id2 = 100: id1 - id2 = 2^31 - 1. Fine — int32 can hold this. → Larger txid is newer.\nWhen id1 = 2^31 + 100, id2 = 100: id1 - id2 = 2^31. Problem — exactly exceeds int32 storage. The value becomes 2^31 - 2^32 = -2^31 \u0026lt; 0. → Smaller txid is considered newer.\nWhen id1 = 100, id2 = 2^31 + 100: id1 - id2 = -2^31. Fine — int32 can hold this. → Larger txid is newer.\nWhen id1 = 100, id2 = 2^31 + 101: id1 - id2 = -2^31 - 1. Problem — exactly exceeds int32 storage. The value becomes -2^31 - 1 + 2^32 = 2^31 - 1 \u0026gt; 0. → Smaller txid is considered newer.\nFrom this analysis, when integer overflow occurs, a transaction with a larger txid cannot see a transaction with a smaller txid. The overflow itself is an exceptional event, so this is acceptable. To address this, PostgreSQL divides the 4-billion transaction ID space into two halves: one half is visible, the other invisible.\nFor example, for transaction txid 100, the 2 billion transactions in its past are visible, and the 2 billion transactions in its future are invisible. Therefore, the maximum difference between the oldest and newest transaction IDs (the database age) in PostgreSQL is |-2^31| = 2^31, roughly 2 billion.\nTransaction ID Wraparound # What is transaction ID wraparound?\nUnderstanding transaction ID wraparound itself is not difficult, but when I first studied it, I found two different definitions:\nPostgreSQL official definition:\nBecause transaction IDs are limited in size (32 bits), a cluster that runs for a long time (more than 4 billion transactions) will suffer transaction ID wraparound: the XID counter wraps around to zero, and suddenly past transactions appear to be in the future — meaning they become invisible. In short, catastrophic data loss. (The data is still there, but you can\u0026rsquo;t access it.)\ninterdb explanation:\nA tuple\u0026rsquo;s t_xmin records the minimum transaction of that tuple. If the tuple never changes, this t_xmin stays the same. Suppose tuple_1 was created by transaction txid=100, so its t_xmin=100. If the database advances by 2^31 transactions, reaching 2^31+100, tuple_1 is still visible. Then another transaction starts, advancing txid to 2^31+101. Now txid=100 is in the \u0026ldquo;future,\u0026rdquo; so tuple_1 becomes invisible. This is severe data loss — this is transaction ID wraparound.\nYes, the official documentation and some classic articles define transaction ID wraparound differently. They are indeed describing two different things. I attribute this to a translation issue: both behaviors are wraparound in English semantics. If you reconsider the meaning of \u0026ldquo;wraparound,\u0026rdquo; they are both forms of it.\nHowever, they differ: one is when transaction IDs (2^32) are fully exhausted and wrap back to 0; the other is when the \u0026ldquo;oldest transaction ID\u0026rdquo; and \u0026ldquo;newest transaction ID\u0026rdquo; differ by more than 2^31.\nThe official definition of transaction ID wraparound introduces the concept that \u0026ldquo;transaction IDs form a circle.\u0026rdquo; The generally understood transaction ID wraparound problem is the \u0026ldquo;circle divided into two halves, one visible, one invisible\u0026rdquo; concept — when the \u0026ldquo;more than half\u0026rdquo; threshold is crossed, that\u0026rsquo;s wraparound. In practice, the wraparound problem you actually need to worry about is the latter: the difference between the newest and oldest transaction IDs must not exceed 2.1 billion (2^31).\nHow long does 2.1 billion transactions take?\n2.1 billion transactions sounds like a lot, but it can still be exhausted.\nFor example, a PostgreSQL database with 100 TPS (not counting SELECT statements, since simple SELECTs don\u0026rsquo;t allocate transaction IDs) uses 8,640,000 transactions per day. It takes only about 2,147,483,648 / 8,640,000 ≈ 248 days to exhaust 2.1 billion transaction IDs and trigger wraparound. At 1,000 transactions per second, it takes less than one month. So transaction ID wraparound is something you must pay attention to in PostgreSQL.\nTransaction ID Freezing # To solve the serious data loss problem caused by transaction ID wraparound, PostgreSQL introduced the concept of transaction freezing.\nXIDs are reused cyclically and divided into two halves: one visible, one invisible. For a tuple with xid=100, if no operations are performed and transaction IDs keep advancing, the once-visible tuple will eventually become invisible.\nAs mentioned earlier, there is a frozen transaction ID. If the tuple with xid=100 is marked with the frozen transaction ID, it will remain visible. This is the purpose of transaction freezing.\nThe frozen transaction ID FrozenTransactionId = 2, and it is older than all normal transactions. That means txid=2 is visible to all normal transactions (txid \u0026gt;= 3). When t_xmin is older than current_txid - vacuum_freeze_min_age (default 50 million), the tuple is rewritten with the frozen transaction ID 2. In version 9.4 and later, the xmin_frozen flag in t_infomask is used to indicate a frozen tuple, rather than rewriting t_xmin to 2.\nThere are many optimization approaches to the transaction ID wraparound problem, but none can avoid transaction freezing. Freezing involves reading every row of every table and resetting flags — a massive I/O and CPU operation. There\u0026rsquo;s no escaping it; the database may even reject all operations until freezing completes. This is known as the \u0026ldquo;freeze bomb.\u0026rdquo; The busier the system and the higher the transaction rate, the more likely it is to trigger. (We\u0026rsquo;ll expand on freeze optimization in a future chapter.)\n64-bit Transaction IDs # The ultimate solution to transaction ID exhaustion and wraparound is using 64-bit transaction IDs. A 32-bit txid provides 2^32 IDs; a 64-bit txid provides 2^64. Even at 10,000 transactions per second — 864 million per day — it would take 58.49 million years to exhaust them. With 64-bit transaction IDs, they are practically inexhaustible. No wraparound, no freezing, no \u0026ldquo;freeze bomb\u0026rdquo;\u0026hellip;\nWhy hasn\u0026rsquo;t 64-bit transaction ID been implemented yet?\nNote: 64-bit transaction IDs already exist in PostgreSQL (as FullTransactionId described earlier). However, because tuple storage is limited, the xmin, xmax, etc. in tuples still use 32-bit XIDs, and transaction ID comparison still relies on 32-bit XIDs. xmin and xmax — the transaction IDs for insert and delete — are stored in each tuple\u0026rsquo;s header (we\u0026rsquo;ll cover tuple structure later), and header space is limited. A 32-bit txid is 4 bytes; a 64-bit txid is 8 bytes. Storing both xmin and xmax as 64-bit would require an extra 8 bytes, which the current header cannot accommodate. The community has discussed two approaches:\nExtend the header to store 64-bit transaction IDs directly. Keep the header size unchanged. Retain 64-bit transaction IDs in memory, adding an epoch concept to convert between the two. The first approach has been essentially abandoned — compared to other systems, PostgreSQL\u0026rsquo;s tuple header is already large enough.\nThe second approach already has epochs and FullTransactionId-to-TransactionId conversion. The key is how to convert the TransactionId in tuples to FullTransactionId (though some extra storage for the epoch would still be needed — otherwise, how to implement it?).\nSee community mailing list discussions:\nhttps://www.postgresql.org/message-id/CAEYLb_UfC+HZ4RAP7XuoFZr+2_ktQmS9xqcQgE-rNf5UCqEt5A@mail.gmail.com\nhttps://www.postgresql.org/message-id/flat/DA1E65A4-7C5A-461D-B211-2AD5F9A6F2FD%40gmail.com\nThe community proposed 64-bit transaction IDs as a permanent solution to the freeze problem back in 2014, and began discussing practical implementation in 2017. But after several PostgreSQL versions, it\u0026rsquo;s still vaporware. Given the sensitivity and importance of data in databases, and how many things transaction ID changes touch — one slip could mean data loss or unknown bugs — PostgreSQL is moving cautiously. However, the community is still considering it. Hopefully one day, in some PostgreSQL version, the transaction ID wraparound problem will be completely solved.\nTransaction ID References # The Internals of PostgreSQL\nhttps://www.interdb.jp/pg/pgsql05.html\nhttps://www.interdb.jp/pg/pgsql06.html\nhttps://www.slideshare.net/masahikosawada98/introduction-vauum-freezing-xid-wraparound?from_action=save\nhttps://www.modb.pro/db/427012\nhttps://www.modb.pro/db/377530\nhttps://www.postgresql.org/docs/13/routine-vacuuming.html\nhttps://blog.csdn.net/weixin_30916255/article/details/112365965\nhttps://wiki.postgresql.org/wiki/FullTransactionId\nhttps://www.bookstack.cn/read/aliyun-rds-core/bd7e1c1955b35f7d.md\nhttps://github.com/digoal/blog/blob/master/201605/20160520_01.md\nTransaction-Related Tuple Structure # The tuple structure contains much of the information essential to PostgreSQL\u0026rsquo;s MVCC. The following sections cover xmin, xmax, t_ctid, cmin, cmax, combo CID, and tuple ID — their meanings and relationships.\nPhysical Structure # HeapTupleHeaderData is the tuple header. Its structure is defined in src/include/access/htup_details.h:\ntypedef struct HeapTupleFields { TransactionId t_xmin;\t/* transaction ID of inserter */ TransactionId t_xmax;\t/* transaction ID of deleter or locker */ union { CommandId\tt_cid;\t/* command ID of insert or delete */ TransactionId t_xvac;\t/* VACUUM FULL transaction ID */ }\tt_field3; } HeapTupleFields; typedef struct DatumTupleFields { ... } DatumTupleFields; struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; }\tt_choice; ItemPointerData t_ctid;\t/* TID of current tuple or updated tuple */ ... }; Five definitions in HeapTupleHeaderData are critically important to MVCC. (Here, \u0026ldquo;x\u0026rdquo; = transaction, \u0026ldquo;c\u0026rdquo; = command, \u0026ldquo;t\u0026rdquo; = tuple — helpful for categorization.)\nt_xmin: the transaction ID that inserted this tuple. t_xmax: the transaction ID that deleted this tuple, or the transaction ID that rolled back. If the tuple has not been deleted or updated, xmax is 0. If the delete or update was rolled back, xmax is the rolling-back transaction\u0026rsquo;s ID. t_xvac: the transaction ID set when the tuple is vacuumed. At that point, the tuple is detached from its original transaction. t_cid: the command ID (cid). A transaction can contain multiple SQL statements. Commands within a transaction are numbered starting from 0, incrementing sequentially. CommandId is a uint32 type, supporting up to 2^32 - 1 commands. To conserve resources, and because queries don\u0026rsquo;t affect row transaction ordering, queries do not increment cid (similar to how transaction IDs are allocated). t_ctid: stores a pointer to itself or to a newer tuple. TID identifies a tuple within a table — it is the tuple\u0026rsquo;s physical address. If a record is modified multiple times, multiple versions exist. These versions are linked via t_ctid, forming a version chain that can be followed to find the latest version. System Columns # Every tuple has 6 system columns (directly queryable): tableoid, xmin, xmax, cmin, cmax, ctid. tableoid is the table\u0026rsquo;s OID and doesn\u0026rsquo;t change during queries or DML. Here we focus on the remaining 5:\nlzldb=# select xmin,xmax,cmin,cmax,ctid from lzl1; xmin | xmax | cmin | cmax | ctid ------+------+------+------+------- 616 | 619 | 0 | 0 | (0,3) cmin: the command ID that inserted the tuple. cmax: the command ID that deleted the tuple. xmin, xmax, and xvac are physically stored in struct HeapTupleFields. But cmin and cmax are not separate fields — they are derived from t_cid in the struct.\nThe source for cmin and cmax is in src/include/access/htup_details.h:\n/* SetCmin is reasonably simple since we never need a combo CID */ #define HeapTupleHeaderSetCmin(tup, cid) \\ do { \\ Assert(!((tup)-\u0026gt;t_infomask \u0026amp; HEAP_MOVED)); \\ (tup)-\u0026gt;t_choice.t_heap.t_field3.t_cid = (cid); \\ (tup)-\u0026gt;t_infomask \u0026amp;= ~HEAP_COMBOCID; \\ } while (0) /* SetCmax must be used after HeapTupleHeaderAdjustCmax; see combocid.c */ #define HeapTupleHeaderSetCmax(tup, cid, iscombo) \\ do { \\ Assert(!((tup)-\u0026gt;t_infomask \u0026amp; HEAP_MOVED)); \\ (tup)-\u0026gt;t_choice.t_heap.t_field3.t_cid = (cid); \\ if (iscombo) \\ (tup)-\u0026gt;t_infomask |= HEAP_COMBOCID; \\ else \\ (tup)-\u0026gt;t_infomask \u0026amp;= ~HEAP_COMBOCID; \\ } while (0) /* * HeapTupleHeaderGetRawCommandId will give you what\u0026#39;s in the header whether * it is useful or not. Most code should use HeapTupleHeaderGetCmin or * HeapTupleHeaderGetCmax instead, but note that those Assert that you can * get a legitimate result, ie you are in the originating transaction! */ #define HeapTupleHeaderGetRawCommandId(tup) \\ ( \\ (tup)-\u0026gt;t_choice.t_heap.t_field3.t_cid \\ ) Combo CID # Before 8.3, cmin and cmax were separate. Later, considering that it\u0026rsquo;s rare for a single transaction to both insert and delete the same row, and that cmin/cmax are not needed after the transaction ends, the two were merged into a \u0026ldquo;combo command ID,\u0026rdquo; or combocid, to save header space.\ncombocid source: src/backend/utils/time/combocid.c\n/* Key and entry structures for the hash table */ typedef struct { CommandId\tcmin; CommandId\tcmax; } ComboCidKeyData; /* comboid structure is cmin and cmax */ static CommandId GetComboCommandId(CommandId cmin, CommandId cmax) { ... /* * The hash table is only created the first time a combo cid is used */ if (comboHash == NULL) { HASHCTL\thash_ctl; /* generate array and hash table */ comboCids = (ComboCidKeyData *) MemoryContextAlloc(TopTransactionContext, sizeof(ComboCidKeyData) * CCID_ARRAY_SIZE); sizeComboCids = CCID_ARRAY_SIZE; usedComboCids = 0; memset(\u0026amp;hash_ctl, 0, sizeof(hash_ctl)); ... comboHash = hash_create(\u0026#34;Combo CIDs\u0026#34;, CCID_HASH_SIZE, \u0026amp;hash_ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); } ... } combocid is stored in a hash table. The first time a transaction uses combocid, a small block of memory is allocated to store it.\nSo the relationship among these command IDs is: combocid → (cmin, cmax) → (t_cid, t_cid).\nSimple Relationships Among Transaction IDs and System Columns # With all these IDs and source code, things might seem confusing. Here\u0026rsquo;s a diagram to help understand and remember the relationships among transaction IDs, command IDs, and tuple IDs:\nA First Taste of Transactions # Without any tools or extensions, let\u0026rsquo;s get a feel for how these system columns change during a transaction:\nlzldb=# select xmin,xmax,cmin,cmax,ctid from lzl1; xmin | xmax | cmin | cmax | ctid ------+------+------+------+------- 622 | 0 | 0 | 0 | (0,1) lzldb=# begin ; BEGIN lzldb=*# update lzl1 set a=2; UPDATE 1 -- after update, xmin+1, ctid+1; a new tuple appears lzldb=* select xmin,xmax,cmin,cmax,ctid from lzl1; xmin | xmax | cmin | cmax | ctid ------+------+------+------+------- 623 | 0 | 0 | 0 | (0,2) lzldb=*# rollback; ROLLBACK -- xmax records the rollback transaction ID -- xmin and ctid return to old values; the old tuple barely changes lzldb=# select xmin,xmax,cmin,cmax,ctid from lzl1; xmin | xmax | cmin | cmax | ctid ------+------+------+------+------- 622 | 623 | 0 | 0 | (0,1) lzldb=# update lzl1 set a=2; UPDATE 1 -- update again; tuple number jumps over 2 directly to 3 lzldb=# select xmin,xmax,cmin,cmax,ctid from lzl1; xmin | xmax | cmin | cmax | ctid ------+------+------+------+------- 624 | 0 | 0 | 0 | (0,3) Tuple Header and Transactions # The pageinspect Extension # Simply looking at row changes won\u0026rsquo;t show old tuples. You need the pageinspect extension. pageinspect is a contrib module bundled with PostgreSQL that can display the detailed contents of data pages. To observe how tuples support transactions, we\u0026rsquo;ll use get_raw_page() and heap_page_items().\nget_raw_page(): returns the binary content of a specified block. The fork parameter accepts main, fsm, vm, or init. main is the main data file; fsm is the free space map; vm is the visibility map; init is the initialization fork. Defaults to main if not specified.\nheap_page_items(): displays all line pointers on a heap page, including rows invisible under MVCC.\nGenerally, get_raw_page() is passed as a parameter to heap_page_items() to display tuple headers, pointers, and the data itself.\nheap_tuple_infomask_flags: converts decimal infomask/infomask2 values into their meanings (flags), outputting two columns: all individual flags and combined flags. (Infomask is covered later.)\nlzldb=# create extension pageinspect; CREATE EXTENSION lzldb=# select t_xmin,t_xmax,t_field3 as t_cid,t_ctid from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)); t_xmin | t_xmax | t_cid | t_ctid --------+--------+-------+-------- 633 | 0 | 0 | (0,1) lp (Line Pointer) # A line pointer is essentially a row pointer number within a page, marking a tuple\u0026rsquo;s location. t_ctid looks more like a tuple ID, but ctid is simply the combination of (table page number, line pointer number). ctid can point to the next lp.\nFor example, after one UPDATE, a new tuple is added. The new tuple\u0026rsquo;s lp number increments by 1, the old tuple\u0026rsquo;s ctid points to the new tuple\u0026rsquo;s lp, and the new tuple\u0026rsquo;s ctid points to itself:\nlzldb=# select lp,t_ctid from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)); lp | t_ctid ----+-------- 1 | (0,1) (1 row) lzldb=# update lzl1 set a=2; UPDATE 1 lzldb=# select lp,t_ctid from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)); lp | t_ctid ----+-------- 1 | (0,2) 2 | (0,2) lp source: src/include/storage/itemid.h. The ItemIdData struct stores the tuple\u0026rsquo;s offset, state, and length:\ntypedef struct ItemIdData { unsigned\tlp_off:15,\t/* tuple offset within the page */ lp_flags:2,\t/* lp state */ lp_len:15;\t/* tuple length */ } ItemIdData; typedef ItemIdData *ItemId;* * /* lp_off:15 is a bit-field; lp_off occupies 15 bits of the unsigned. The 3 fields together total 32 bits. So ItemIdData is an int, 4 bytes, 32 bits. */ lp_flags defines 4 states:\n/* *lp_flags has these possible states. An UNUSED line pointer is available *for immediate re-use, the other states are not. */ #define LP_UNUSED\t0\t/* lp not in use, tuple length lp_len always 0 */ #define LP_NORMAL\t1\t/* lp in use, tuple length lp_len always \u0026gt; 0 */ #define LP_REDIRECT\t2\t/* HOT redirect to another lp (should have lp_len=0) */ #define LP_DEAD\t3\t/* dead lp, vacuumable */ lzldb=# select lp,lp_flags,lp_off,lp_len from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)); lp | lp_flags | lp_off | lp_len ----+----------+--------+-------- 1 | 1 | 8160 | 28 Infomask # Infomask provides information about transactions, locks, tuple state, etc. — such as committed, aborted, lock, HOT info, and more. There are two infomask fields in the header: infomask and infomask2. They store different information.\ninfomask and infomask2 # infomask source is in src/include/access/htup_details.h:\n#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2 uint16\tt_infomask2;\t/* number of attributes + various flags */ #define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3 uint16\tt_infomask;\t/* various flag bits, see below */ infomask Flag Meanings # /* * information stored in t_infomask: */ #define HEAP_HASNULL\t0x0001\t/* tuple has null values */ #define HEAP_HASVARWIDTH\t0x0002\t/* tuple has variable-width attributes, e.g. varchar */ #define HEAP_HASEXTERNAL\t0x0004\t/* tuple has TOAST storage */ #define HEAP_HASOID_OLD\t0x0008\t/* tuple has OID */ #define HEAP_XMAX_KEYSHR_LOCK\t0x0010\t/* tuple has FOR KEY SHARE lock */ #define HEAP_COMBOCID\t0x0020\t/* t_cid is a combo CID */ #define HEAP_XMAX_EXCL_LOCK\t0x0040\t/* tuple has FOR UPDATE lock */ #define HEAP_XMAX_LOCK_ONLY\t0x0080\t/* xmax is only a locker */ /* xmax is a shared locker */ #define HEAP_XMAX_SHR_LOCK\t(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK) #define HEAP_LOCK_MASK\t(HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \\ HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMIN_COMMITTED\t0x0100\t/* inserting transaction committed */ #define HEAP_XMIN_INVALID\t0x0200\t/* inserting transaction invalid or aborted */ #define HEAP_XMIN_FROZEN\t(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID) #define HEAP_XMAX_COMMITTED\t0x0400\t/* deleting transaction committed */ #define HEAP_XMAX_INVALID\t0x0800\t/* deleting transaction invalid or aborted */ #define HEAP_XMAX_IS_MULTI\t0x1000\t/* t_xmax is a MultiXactId */ #define HEAP_UPDATED\t0x2000\t/* this is an updated version of a row */ #define HEAP_MOVED_OFF\t0x4000\t/* moved elsewhere by pre-9.0 VACUUM FULL; kept for binary upgrade compatibility */ #define HEAP_MOVED_IN\t0x8000\t/* moved from elsewhere, opposite of HEAP_MOVED_OFF; kept for compatibility */ #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN) #define HEAP_XACT_MASK\t0xFFF0\t/* visibility-related bits */ infomask2 Flag Meanings # #define HEAP_NATTS_MASK\t0x07FF\t/* 11 bits for the number of columns (MaxHeapAttributeNumber is 1600) */ /* bits 0x1800 are available */ #define HEAP_KEYS_UPDATED\t0x2000\t/* tuple updated or deleted */ #define HEAP_HOT_UPDATED\t0x4000\t/* tuple updated, new tuple is HOT */ #define HEAP_ONLY_TUPLE\t0x8000\t/* HOT tuple */ #define HEAP2_XACT_MASK\t0xE000\t/* visibility-related bits */ #define HEAP_TUPLE_HAS_MATCH\tHEAP_ONLY_TUPLE /* flag temporarily used in Hash Join, only for Hash table tuples that don\u0026#39;t need visibility info; we can reuse a visibility flag instead of a separate bit */ infomask Bit Calculation # Converting hex to binary makes it easier to understand the bit meanings:\n-- convert hex 1600 to binary lzldb=# select x\u0026#39;1600\u0026#39;::bit(16); bit ------------------ 0001011000000000 infomask:\n0000000000000001 0x0001 HEAP_HASNULL\t0000000000000010 0x0002 HEAP_HASVARWIDTH\t0000000000000100 0x0004 HEAP_HASEXTERNAL\t0000000000001000 0x0008 HEAP_HASOID_OLD\t0000000000010000 0x0010 HEAP_XMAX_KEYSHR_LOCK\t0000000000100000 0x0020 HEAP_COMBOCID 0000000001000000 0x0040 HEAP_XMAX_EXCL_LOCK 0000000010000000 0x0080 HEAP_XMAX_LOCK_ONLY\t0000000001010000 0x0050 HEAP_XMAX_SHR_LOCK bitwise OR: (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)=10|40=50 0000000001010000 0x0050 HEAP_LOCK_MASK bitwise OR: (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)=50|40|10=50 0000000100000000 0x0100 HEAP_XMIN_COMMITTED\t0000001000000000 0x0200 HEAP_XMIN_INVALID\t0000001100000000 0x0300 HEAP_XMIN_FROZEN bitwise OR: (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)=100|200=300 0000010000000000 0x0400 HEAP_XMAX_COMMITTED\t0000100000000000 0x0800 HEAP_XMAX_INVALID\t0001000000000000 0x1000 HEAP_XMAX_IS_MULTI\t0010000000000000 0x2000 HEAP_UPDATED\t0100000000000000 0x4000 HEAP_MOVED_OFF\t1000000000000000 0x8000 HEAP_MOVED_IN\t1100000000000000 0xC000 HEAP_MOVED bitwise OR: (HEAP_MOVED_OFF | HEAP_MOVED_IN)=4000|8000=C000 1111111111110000 0xFFF0 HEAP_XACT_MASK infomask2:\n0000011111111111 0x07FF HEAP_NATTS_MASK PostgreSQL max columns is 1600 = 0000011001000000, so 11 bits suffice for column count 0001100000000000 0x1800 available bits, apparently unused 0010000000000000 0x2000 HEAP_KEYS_UPDATED 0100000000000000 0x4000 HEAP_HOT_UPDATED 1000000000000000 0x8000 HEAP_ONLY_TUPLE 1110000000000000 0xE000 HEAP2_XACT_MASK How to Compute Infomask? # Infomask flags are hexadecimal. pageinspect returns them as decimal. Use to_hex() to convert from decimal to hexadecimal:\nlzldb=# select lp,t_ctid,to_hex(t_infomask) infomask,to_hex(t_infomask2) infomask2 from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)); lp | t_ctid | infomask | infomask2 ----+--------+----------+----------- 1 | (0,1) | 2b00 | 1 infomask=2b00 — still a bit opaque. Convert to binary and match against the flag meanings: 0010101100000000 = HEAP_UPDATED + HEAP_XMAX_INVALID + HEAP_XMIN_FROZEN.\nMeaning: the tuple was updated, xmax is invalid (0), xmin is frozen (visible to all transactions).\ninfomask2=1 — the first 11 bits of binary (first 2047 in decimal, for up to 1600 columns) represent the number of user columns. So 1 means the tuple has only 1 column.\nManually computing infomask is tedious. Starting from pg13, pageinspect provides the heap_tuple_infomask_flags function to decode infomask and infomask2. Individual bits are shown as raw_flags; combined multi-bit flags are shown as combined_flags:\nlzldb=# SELECT t_ctid, raw_flags, combined_flags FROM heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;, 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; t_ctid | raw_flags | combined_flags --------+------------------------------------------------------------------------+-------------------- (0,1) | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_UPDATED} | {HEAP_XMIN_FROZEN} Commit Log (CLOG) # PostgreSQL uses the commit log (CLOG) to store transaction status. PostgreSQL writes the transaction to WAL before completion — that\u0026rsquo;s what WAL means. If a transaction aborts, its status is written to both WAL and CLOG so that during instance recovery, PostgreSQL knows the transaction was not committed.\nWhen transaction status is needed — for example, when determining visibility — PostgreSQL reads the CLOG.\nTransaction status\nSource: src/include/access/clog.h\n#define TRANSACTION_STATUS_IN_PROGRESS\t0x00 #define TRANSACTION_STATUS_COMMITTED\t0x01 #define TRANSACTION_STATUS_ABORTED\t0x02 #define TRANSACTION_STATUS_SUB_COMMITTED\t0x03 The CLOG defines four transaction states: IN_PROGRESS, COMMITTED, ABORTED, SUB_COMMITTED.\nTransaction status size\nSource: src/backend/access/transam/clog.c\n/* We need two bits per xact, so four xacts fit in a byte */ #define CLOG_BITS_PER_XACT\t2 #define CLOG_XACTS_PER_BYTE 4 #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE) #define CLOG_XACT_BITMASK\t((1 \u0026lt;\u0026lt; CLOG_BITS_PER_XACT) - 1) Transaction status is very small — only 2 bits per transaction. One byte can store 4 transaction states. A standard page can hold 8K * 4 = 32,768 transaction states.\nCLOG persistence\nWhen PostgreSQL shuts down or checkpoints, CLOG data is written to the pg_clog directory. In version 10.0 and later, pg_clog was renamed to pg_xact.\n[pg@lzl pg_xact]$ ll total 8 -rw------- 1 pg pg 8192 Mar 28 23:33 0000 On disk, CLOG files are named 0000, 0001, etc. CLOG files are 256KB in size, while in-memory pages storing transaction states are 8KB. So the 0000 file\u0026rsquo;s size will always be a multiple of 8192. After 32 CLOG pages are written, the next page goes into the 0001 file. PostgreSQL reads transaction states from pg_xact into memory at startup.\nDuring system operation, not all transaction states need to be retained in CLOG files forever, so VACUUM periodically deletes no-longer-needed CLOG files.\nHint Bits # What Are Hint Bits? # Hint bits mark whether the transaction that created or deleted a row has committed or aborted. Without hint bits, determining transaction visibility requires accessing on-disk pg_clog or pg_subtrans — a relatively expensive operation. If a tuple has hint bits set, you can determine the tuple\u0026rsquo;s state just by reading the page — no extra access needed.\nThe source code uses SetHintBits() to set hint bits:\nSetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, InvalidTransactionId); SetHintBits only sets 2 bits in infomask, for 4 hint bit flags (these 2 bits also combine into HEAP_XMIN_FROZEN — it\u0026rsquo;s clear that hint bits exist purely to mark transaction state):\n#define HEAP_XMIN_COMMITTED\t0x0100\t/* inserting or updating transaction committed */ #define HEAP_XMIN_INVALID\t0x0200\t/* inserting or updating transaction invalid or aborted */ #define HEAP_XMAX_COMMITTED\t0x0400\t/* deleting or updating transaction committed */ #define HEAP_XMAX_INVALID\t0x0800\t/* deleting or updating transaction invalid or aborted */ Queries Can Cause Writes # When a transaction starts, PostgreSQL DML transactions record the transaction ID and status (like t_xmin) in the tuple header. But when the transaction ends, nothing is done to the header. Instead, a subsequent DML, DQL, or VACUUM that scans the relevant tuple triggers SetHintBits (this happens in HeapTupleSatisfiesMVCC() when a new snapshot accesses data — we\u0026rsquo;ll cover visibility rules later).\nBefore SetHintBits is triggered, PostgreSQL looks up transaction status in the CLOG. After SetHintBits is triggered, it reads the hint bits in the data page\u0026rsquo;s tuple header.\nFor example, an INSERT statement:\nlzldb=# insert into lzl1 values(1); INSERT 0 1 lzldb=# SELECT t_ctid, raw_flags, combined_flags lzldb-# FROM heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;, 0)), lzldb-# LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) lzldb-# WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; t_ctid | raw_flags | combined_flags --------+---------------------+---------------- (0,1) | {HEAP_XMAX_INVALID} | {} (1 row) lzldb=# select * from lzl1; -- just a single query a --- 1 (1 row) lzldb=# SELECT t_ctid, raw_flags, combined_flags FROM heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;, 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; t_ctid | raw_flags | combined_flags --------+-----------------------------------------+---------------- (0,1) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} After one query, t_infomask changed — the tuple header changed.\nAfter INSERT, SetHintBits only had HEAP_XMAX_INVALID, because INSERT only updates xmin. Whether the transaction commits or aborts (exits or rolls back), xmax is unused and can be set to HEAP_XMAX_INVALID along with the transaction.\nBut the transaction may commit or abort (exit/rollback). Since transaction completion does not update the tuple, HEAP_XMIN_COMMITTED cannot be set upon completion. During visibility checking (heapam_visibility.c), the visibility check updates the transaction state by calling SetHintBits on t_infomask. Thus, the query updated HEAP_XMIN_COMMITTED.\nHint bits advantage: completing (or failing) data modifications in a transaction produces no writes to the tuple. Commit and rollback are very fast.\nHint bits disadvantage: if a transaction updates many rows, the next query performing visibility checks may need to read transaction states from pg_clog and update many pages.\nDo Hint Bits Generate WAL? # When checksums are enabled or wal_log_hints is true, if the first operation to make a page dirty after a checkpoint is updating hint bits, a WAL record is generated — specifically, a Full Page Image — to prevent partial writes that would cause checksum mismatches.\nTherefore, with checksums enabled or wal_log_hints set to true, even a SELECT can modify page hint bits, which may generate WAL — increasing WAL storage to some extent. If you observe SELECT triggering disk writes, check whether CHECKSUM or wal_log_hints is enabled.\nWhy Are Hint Bits Deferred? # In src/backend/access/heap/heapam_visibility.c, within the HeapTupleSatisfiesMVCC() visibility function, a comment explains why hint bits are deferred:\n/* *While insert/delete operations are still running, hint bits on tuples are not updated, *even if the transaction has committed or aborted. *In high-concurrency scenarios, sharing data structures can cause contention, *and this doesn\u0026#39;t affect visibility decisions anyway. *Hint bits are only set the first time a fresh snapshot accesses data after transaction completion. *So HeapTupleSatisfiesMVCC always runs TransactionIdIsCurrentTransactionId and XidInMVCCSnapshot *to determine whether the tuple belongs to the current transaction. *In older versions, PostgreSQL tried to update hint bits immediately (even while transactions were running), *but this caused more contention on the PGXACT array. */ Simply put: immediate hint bit updates perform very poorly. So transaction status is first stored in CLOG to reduce PGXACT contention and improve performance. Deferred hint bits are why later queries may update tuple headers.\nTuple DML Operations # Now that we\u0026rsquo;ve built up knowledge of tuple headers, system columns, CLOG, and hint bits, let\u0026rsquo;s see how PostgreSQL performs INSERT, UPDATE, and DELETE.\nObserving DML Transactions # We\u0026rsquo;ll observe PostgreSQL\u0026rsquo;s DML transaction behavior by examining tuple header fields: lp, lp_flags, ctid, xmin, xmax, cid (cmin, cmax), infomask, and infomask2.\nWe\u0026rsquo;ll use the following query:\nselect t_ctid,lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; (A side note: some sources like to write SELECT '(0,'||lp||')' AS ctid. This is misleading — lp and ctid are different things. lp is like a row number; ctid points to a line pointer number. lp can be different from ctid.)\nFor readability, create a view:\ncreate view vlzl1 as select t_ctid,lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; Now the query looks like:\nlzldb=# \\x Expanded display is on. lzldb=# select * from vlzl1; -[ RECORD 6 ]--+------- t_ctid | (0,6) lp | 6 lp_flags | LP_NORMAL t_xmin | 653 t_xmax | 0 t_cid | 0 raw_flags | {HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} combined_flags | {} INSERT # Truncate the table, then insert a row:\nlzldb=# begin ; BEGIN lzldb=*# insert into lzl1 values(1); INSERT 0 1 lzldb=*# insert into lzl1 values(2); INSERT 0 1 lzldb=*# commit; lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+---------------------+---------------- (0,1) | 1 | LP_NORMAL | 664 | 0 | 0 | {HEAP_XMAX_INVALID} | {} (0,2) | 2 | LP_NORMAL | 664 | 0 | 1 | {HEAP_XMAX_INVALID} | {} ctid points to (page 0, lp 1), i.e., to itself. lp (line pointer number) increments. Both tuples share the same xmin — they were inserted by the same transaction. xmax is 0 (invalid transaction ID). Infomask only indicates xmax is invalid: this tuple has not yet \u0026ldquo;experienced\u0026rdquo; a delete transaction. cid increments from 0: 0 for the first command, 1 for the second. DELETE # lzldb=# begin; BEGIN lzldb=*# delete from lzl1 where a=1; DELETE 1 lzldb=*# commit; COMMIT lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-----------------------------------------+---------------- (0,1) | 1 | LP_NORMAL | 664 | 665 | 0 | {HEAP_XMIN_COMMITTED,HEAP_KEYS_UPDATED} | {} (0,2) | 2 | LP_NORMAL | 664 | 0 | 1 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} The first tuple was deleted. The tuple wasn\u0026rsquo;t physically removed — only a few attributes were marked:\nctid unchanged, still points to itself. xmax updated to the delete transaction ID. Infomask shows HEAP_KEYS_UPDATED, indicating the tuple was deleted (actually, HEAP_KEYS_UPDATED means either deleted or updated). Although only the first tuple was modified, the second tuple\u0026rsquo;s infomask was also updated with HEAP_XMIN_COMMITTED. UPDATE # lzldb=# begin; BEGIN lzldb=# update lzl1 set a=3; UPDATE 1 lzldb=*# commit; COMMIT lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-------------------------------------------------------------+---- (0,1) | 1 | LP_NORMAL | 664 | 665 | 0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_KEYS_UPDATED} | {} (0,3) | 2 | LP_NORMAL | 664 | 666 | 0 | {HEAP_XMIN_COMMITTED,HEAP_HOT_UPDATED} | {} (0,3) | 3 | LP_NORMAL | 666 | 0 | 0 | {HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} An UPDATE doesn\u0026rsquo;t modify the tuple in place. Instead, it marks the old tuple as unavailable and inserts a new one:\nlp=2 is the old tuple from the update transaction. t_xmax is the update transaction ID. Infomask adds HEAP_HOT_UPDATED, indicating the tuple is HOT. ctid points to the new tuple. lp=3 is the new tuple from the update. It\u0026rsquo;s equivalent to an inserted tuple, but xmin matches the old tuple\u0026rsquo;s xmax. Infomask has the extra flag HEAP_UPDATED, indicating this is the updated version. Additionally, the invisible deleted tuple at lp=1 had its infomask updated with HEAP_XMAX_COMMITTED by an unrelated subsequent update transaction. Rollback # lzldb=# truncate table lzl1; TRUNCATE TABLE lzldb=# begin; BEGIN lzldb=*# insert into lzl1 values(1); -- INSERT INSERT 0 1 lzldb=*# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+---------------------+---------------- (0,1) | 1 | LP_NORMAL | 679 | 0 | 0 | {HEAP_XMAX_INVALID} | {} (1 row) lzldb=*# rollback; -- INSERT rolled back ROLLBACK lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+---------------------+---------------- (0,1) | 1 | LP_NORMAL | 679 | 0 | 0 | {HEAP_XMAX_INVALID} | {} lzldb=# select * from lzl1; a --- (0 rows) -- After INSERT and rollback, the tuple header shows no changes. lzldb=# insert into lzl1 values(2); INSERT 0 1 lzldb=# begin ; BEGIN lzldb=*# delete from lzl1 ; -- DELETE DELETE 1 lzldb=*# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-----------------------------------------+---------------- (0,1) | 1 | LP_NORMAL | 684 | 0 | 0 | {HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {} (0,2) | 2 | LP_NORMAL | 685 | 686 | 0 | {HEAP_XMIN_COMMITTED,HEAP_KEYS_UPDATED} | {} (2 rows) lzldb=*# rollback; -- DELETE rolled back ROLLBACK lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-----------------------------------------+---------------- (0,1) | 1 | LP_NORMAL | 684 | 0 | 0 | {HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {} (0,2) | 2 | LP_NORMAL | 685 | 686 | 0 | {HEAP_XMIN_COMMITTED,HEAP_KEYS_UPDATED} | {} -- After DELETE and rollback, the tuple header shows no changes. lzldb=*# update lzl1 set a=100 ; -- UPDATE UPDATE 1 lzldb=*# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+--------------------------------------------------+--------------- (0,1) | 1 | LP_NORMAL | 684 | 0 | 0 | {HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {} (0,3) | 2 | LP_NORMAL | 685 | 688 | 0 | {HEAP_XMIN_COMMITTED,HEAP_HOT_UPDATED} | {} (0,3) | 3 | LP_NORMAL | 688 | 0 | 0 | {HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} (3 rows) lzldb=*# rollback; -- UPDATE rolled back ROLLBACK lzldb=*# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+--------------------------------------------------+--------------- (0,1) | 1 | LP_NORMAL | 684 | 0 | 0 | {HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {} (0,3) | 2 | LP_NORMAL | 685 | 688 | 0 | {HEAP_XMIN_COMMITTED,HEAP_HOT_UPDATED} | {} (0,3) | 3 | LP_NORMAL | 688 | 0 | 0 | {HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} -- After UPDATE and rollback, the tuple header shows no changes. When a transaction rolls back, tuple information does not change at all. This is why PostgreSQL\u0026rsquo;s MVCC doesn\u0026rsquo;t worry about running out of rollback segments — rollback is purely a visibility operation, not a data update. xmax doesn\u0026rsquo;t change after rollback either, which means a non-zero xmax doesn\u0026rsquo;t necessarily indicate the tuple was deleted — the delete or update transaction may have rolled back. However, once visibility checking occurs, even without data changes, all tuples\u0026rsquo; infomask will be updated with HEAP_XMIN_INVALID. Non-HOT tuples get HEAP_XMIN_INVALID, and HOT-referenced tuples naturally get it too. References for Tuple and Transaction # Books:\nThe Internals of PostgreSQL PostgreSQL in Action PostgreSQL Internals: Deep Dive into Transaction Processing PostgreSQL Database Kernel Analysis https://edu.postgrespro.com/postgresql_internals-14_parts1-2_en.pdf\nOfficial resources:\nhttps://en.wikipedia.org/wiki/Concurrency_control\nhttps://wiki.postgresql.org/wiki/Hint_Bits\nhttps://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND\nhttps://www.postgresql.org/docs/10/storage-page-layout.html\nhttps://www.postgresql.org/docs/13/pageinspect.html3\nEssential PostgreSQL transaction reads (interdb):\nhttps://www.interdb.jp/pg/pgsql05.html\nhttps://www.interdb.jp/pg/pgsql06.html\nSource code experts:\nhttps://blog.csdn.net/Hehuyi_In/article/details/102920988\nhttps://blog.csdn.net/Hehuyi_In/article/details/127955762\nhttps://blog.csdn.net/Hehuyi_In/article/details/125023923\nPostgreSQL snapshot optimization performance comparison:\nhttps://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462\nOther resources:\nhttps://brandur.org/postgres-atomicity\nhttps://mp.weixin.qq.com/s/j-8uRuZDRf4mHIQR_ZKIEg\nSnapshots in PostgreSQL # A snapshot is a data structure that records the instantaneous state of the database. PostgreSQL\u0026rsquo;s snapshot stores: the minimum and maximum transaction IDs among all active transactions, the list of currently active transactions, the current transaction\u0026rsquo;s command ID, and more.\nSnapshot data is stored in the SnapshotData struct type. Source: src/include/utils/snapshot.h\ntypedef struct SnapshotData { SnapshotType snapshot_type; /* snapshot type */ TransactionId xmin;\t/* txid \u0026lt; xmin are visible to the snapshot */ TransactionId xmax;\t/* txid \u0026gt;= xmax are invisible to the snapshot */ /* list of active transactions at snapshot time. Only includes txids between xmin and xmax */ TransactionId *xip; uint32\txcnt;\t/* xip_list stored in xip[] */ /* list of active subtransactions at snapshot time */ TransactionId *subxip; int32\tsubxcnt;\t/* subtransactions stored in subxip[] */ bool\tsuboverflowed;\t/* whether subtransactions overflowed; overflows occur with many subtransactions */ bool\ttakenDuringRecovery;\t/* is this a recovery snapshot? */ bool\tcopied;\t/* whether the snapshot is a copy (RR and serializable copy their snapshots); false if static */ CommandId\tcurcid;\t/* command ID in the transaction; CID \u0026lt; curcid is visible */ ... TimestampTz whenTaken;\t/* timestamp when snapshot was taken */ XLogRecPtr\tlsn;\t/* LSN when snapshot was taken */ } SnapshotData; typedef struct SnapshotData *Snapshot; The most important snapshot information is xmin, xmax, and xip_list. Use pg_current_snapshot() (in pg12 and earlier, txid_current_snapshot()) to display the current transaction\u0026rsquo;s snapshot.\nNote: snapshot xmin/xmax are different from tuple xmin/xmax — they have different meanings.\nlzldb=*# select pg_current_snapshot(); pg_current_snapshot --------------------- 100:104:100,102 xmin Earliest active txid. All txids older than xmin have either committed (visible) or aborted (dead tuples). xmax First unassigned txid. xmax = latestCompletedXid + 1. All txid \u0026gt;= xmax have not yet started and are invisible to the current snapshot. xip_list Stored in array xip[]. Since transactions can start and finish out of order (a later-started transaction may finish earlier), xmin and xmax alone cannot fully express all active transactions at snapshot time. xip_list stores the active transactions at snapshot time. Snapshot Types # Beyond MVCC snapshots, PostgreSQL defines several other snapshot types in src/include/utils/snapshot.h:\ntypedef enum SnapshotType { /* Tuple is visible if and only if it satisfies MVCC snapshot visibility rules. * The most important snapshot type — used to implement MVCC. * Tuple visibility is judged based on snapshot xmin, xmax, xip_list, curcid, etc. * If a command changed data, the current MVCC snapshot won\u0026#39;t see it; a new MVCC snapshot is needed. */ SNAPSHOT_MVCC = 0, /* Tuple is visible if its transaction committed. * In-progress transactions are invisible. * Data changes from the current command are visible to the SELF snapshot. */ SNAPSHOT_SELF, /* * Any tuple is visible. */ SNAPSHOT_ANY, /* * Visible if the TOAST tuple is valid. TOAST visibility depends on the main table tuple\u0026#39;s visibility. */ SNAPSHOT_TOAST, /* * Data changes from the current command are visible to the DIRTY snapshot. * The DIRTY snapshot preserves version info for in-progress tuples. * Snapshot xmin is set to the xmin of other in-progress transactions\u0026#39; tuples; xmax is similar. */ SNAPSHOT_DIRTY, /* HISTORIC_MVCC snapshot follows MVCC rules, used for logical decoding. */ SNAPSHOT_HISTORIC_MVCC, /* Determines whether dead tuples are visible to certain transactions. */ SNAPSHOT_NON_VACUUMABLE } SnapshotType; Snapshots and Isolation Levels # Different isolation levels acquire snapshots differently:\nRead Committed requires a new snapshot for each SQL statement in the transaction, while Repeatable Read uses only one snapshot for the entire transaction. The function that acquires snapshots is GetTransactionSnapshot().\nProcess-Level Transaction Structures # When PostgreSQL acquires snapshot data, it needs to scan the transaction state of all backend processes.\nBefore understanding the GetSnapshotData() function, we need to understand several backend process structures: PGPROC, PGXACT, PROC_HDR (PROCGLOBAL), and ProcArray.\nThese process-related structures contain process and lock information. Here we only study the transaction-related parts. Source code examples are based on pg13.\nPGPROC Struct # Source: src/include/storage/proc.h\n// Every backend process stores a PGPROC struct in memory. // Think of this as the backend process\u0026#39;s main structure. struct PGPROC { ... LocalTransactionId lxid;\t/* local id of top-level transaction currently * being executed by this proc, if running; * else InvalidLocalTransactionId */ ... struct XidCache subxids;\t/* cached subtransaction XIDs */ ... /* clog group transaction status update */ bool\tclogGroupMember;\t/* whether this proc uses clog group commit */ pg_atomic_uint32 clogGroupNext; /* atomic int, pointing to the next group member proc */ TransactionId clogGroupMemberXid;\t/* xid to be committed */ XidStatus\tclogGroupMemberXidStatus;\t/* status of the xid to be committed */ int\tclogGroupMemberPage;\t/* which page the xid to be committed belongs to */ XLogRecPtr\tclogGroupMemberLsn; /* LSN of the commit log for the xid to be committed */ }; /* NOTE: \u0026#34;typedef struct PGPROC PGPROC\u0026#34; appears in storage/lock.h. Not written with the struct itself. */ PGXACT Struct # // Before 9.2, PGXACT information was inside PGPROC. Stress testing showed that on multi-CPU systems, // separating them makes GetSnapshotData faster by reducing the number of cache lines fetched. typedef struct PGXACT { TransactionId xid;\t/* id of top-level transaction currently being * executed by this proc, if running and XID * is assigned; else InvalidTransactionId */ // appears to be the current process\u0026#39;s xmax TransactionId xmin;\t/* excluding lazy vacuum; minimum xid at transaction start; vacuum cannot remove tuples with xid \u0026gt;= xmin */ uint8\tvacuumFlags;\t/* vacuum-related flags, see above */ bool\toverflowed; // whether PGXACT overflowed uint8\tnxids; } PGXACT; PGXACT stores relatively simple information — the backend\u0026rsquo;s xmin, xmax, and other transaction-related fields. PGPROC leans toward storing basic backend info; some less frequently accessed transaction info remains in PGPROC, but the core process transaction info is in PGXACT.\nPROC_HDR (PROCGLOBAL) Struct # Every backend process has a proc struct. In high-concurrency scenarios, scanning all proc structs to find transaction info is time-consuming. An instance-level structure is needed to store all proc info — this is PROCGLOBAL.\nThe source typically uses the struct type PROC_HDR to define a struct pointer to PROCGLOBAL. PROC_HDR stores global proc info: the full array of proc structs, free procs, etc.\nSource: src/include/storage/proc.h\ntypedef struct PROC_HDR { /* pgproc array (not including dummies for prepared txns) */ PGPROC\t*allProcs; /* pgxact array (not including dummies for prepared txns) */ PGXACT\t*allPgXact; ... /* Current shared estimate of appropriate spins_per_delay value */ int\tspins_per_delay; /* The proc of the Startup process, since not in ProcArray */ PGPROC\t*startupProc; int\tstartupProcPid; /* Buffer id of the buffer that Startup process waits for pin on, or -1 */ int\tstartupBufferPinWaitBufId; } PROC_HDR; ProcArray Struct # ProcArray is in procarray.c, which maintains the PGPROC and PGXACT structures for all backends.\nSource: src/backend/storage/ipc/procarray.c\ntypedef struct ProcArrayStruct { int\tnumProcs;\t/* number of procs */ int\tmaxProcs;\t/* size of proc array */ // handling assigned xids int\tmaxKnownAssignedXids;\t/* allocated size of array */ int\tnumKnownAssignedXids;\t/* current # of valid entries */ int\ttailKnownAssignedXids;\t/* index of oldest valid element */ int\theadKnownAssignedXids;\t/* index of newest element, + 1 */ slock_t\tknown_assigned_xids_lck;\t/* protects head/tail pointers */ /* * Highest subxid that has been removed from KnownAssignedXids array to * prevent overflow; or InvalidTransactionId if none. We track this for * similar reasons to tracking overflowing cached subxids in PGXACT * entries. Must hold exclusive ProcArrayLock to change this, and shared * lock to read it. */ TransactionId lastOverflowedXid; /* oldest xmin of any replication slot */ TransactionId replication_slot_xmin; /* oldest catalog xmin of any replication slot */ TransactionId replication_slot_catalog_xmin; /* pgprocnos, equivalent to allPgXact[] array indices, used to look up allPgXact[]; this array has PROCARRAY_MAXPROCS entries */ int\tpgprocnos[FLEXIBLE_ARRAY_MEMBER]; } ProcArrayStruct; static ProcArrayStruct *procArray; Acquiring a Snapshot # GetTransactionSnapshot() # Snapshots are acquired via GetTransactionSnapshot().\nSource: src/backend/utils/time/snapmgr.c\n// GetTransactionSnapshot() allocates the appropriate snapshot for SQL in a transaction Snapshot GetTransactionSnapshot(void) { // Return historic snapshot if doing logical decoding. We\u0026#39;ll never need a // non-historic snapshot after this, so return directly. if (HistoricSnapshotActive()) { Assert(!FirstSnapshotSet); return HistoricSnapshot; } /* If it\u0026#39;s not the first call in this transaction, enter this if */ if (!FirstSnapshotSet) { /* * Ensure the catalog snapshot is fresh. */ InvalidateCatalogSnapshot(); Assert(pairingheap_is_empty(\u0026amp;RegisteredSnapshots)); Assert(FirstXactSnapshot == NULL); // Return error if in parallel mode if (IsInParallelMode()) elog(ERROR, \u0026#34;cannot take query snapshot during a parallel operation\u0026#34;); // For Repeatable Read or Serializable, use the same snapshot for the entire transaction; only copy once // IsolationUsesXactSnapshot() means the isolation level is RR or Serializable — they use one snapshot per transaction if (IsolationUsesXactSnapshot()) { // First, create the snapshot in CurrentSnapshotData // If SI isolation level, initialize SSI-required data structures if (IsolationIsSerializable()) CurrentSnapshot = GetSerializableTransactionSnapshot(\u0026amp;CurrentSnapshotData); else CurrentSnapshot = GetSnapshotData(\u0026amp;CurrentSnapshotData); /* Make a saved copy */ /* For Repeatable Read or Serializable, this snapshot lasts the entire transaction; copy once */ CurrentSnapshot = CopySnapshot(CurrentSnapshot); FirstXactSnapshot = CurrentSnapshot; /* Mark it as \u0026#34;registered\u0026#34; in FirstXactSnapshot */ FirstXactSnapshot-\u0026gt;regd_count++; pairingheap_add(\u0026amp;RegisteredSnapshots, \u0026amp;FirstXactSnapshot-\u0026gt;ph_node); } else // For Read Committed, acquire a snapshot CurrentSnapshot = GetSnapshotData(\u0026amp;CurrentSnapshotData); // Modify flag to indicate this is the first snapshot; subsequent calls in this transaction won\u0026#39;t enter this if FirstSnapshotSet = true; return CurrentSnapshot; } // If not the first call in this transaction (already have a first snapshot) // For Repeatable Read or Serializable, return a copy of the first snapshot if (IsolationUsesXactSnapshot()) return CurrentSnapshot; /* Don\u0026#39;t allow catalog snapshot to be older than xact snapshot. */ InvalidateCatalogSnapshot(); // Read Committed: re-acquire snapshot CurrentSnapshot = GetSnapshotData(\u0026amp;CurrentSnapshotData); return CurrentSnapshot; } About IsolationUsesXactSnapshot() and IsolationIsSerializable():\nDefined as macros in src/include/access/xact.h:\n#define XACT_READ_UNCOMMITTED\t0 #define XACT_READ_COMMITTED\t1 #define XACT_REPEATABLE_READ\t2 #define XACT_SERIALIZABLE\t3 // Internally only 3 isolation levels: 1, 2, 3 // 2 isolation levels use one snapshot per transaction; others use one snapshot per SQL statement #define IsolationUsesXactSnapshot() (XactIsoLevel \u0026gt;= XACT_REPEATABLE_READ) #define IsolationIsSerializable() (XactIsoLevel == XACT_SERIALIZABLE) IsolationUsesXactSnapshot() is true for Repeatable Read or Serializable.\nIsolationIsSerializable() is true for Serializable only.\nGetTransactionSnapshot() flow chart:\n(image from CSDN: https://blog.csdn.net/Hehuyi_In)\nThe main logic of GetTransactionSnapshot():\nFor historic snapshots during logical decoding, return the snapshot result directly. For Repeatable Read or Serializable: on the first call, return the snapshot and copy it so subsequent calls (non-first) can directly reference it. For Read Committed: generate a new snapshot on every call. For the first call in Serializable, additionally acquire SSI data information. GetTransactionSnapshot() acquires the snapshot; the actual data comes from GetSnapshotData(). GetSnapshotData() # Source: src/backend/storage/ipc/procarray.c\nSnapshot GetSnapshotData(Snapshot snapshot) { // Initialize some variables: arrayP pointer, procarray, xmin, xmax, replication slot txid, etc. ProcArrayStruct *arrayP = procArray; TransactionId xmin; TransactionId xmax; TransactionId globalxmin; int\tindex; int\tcount = 0; int\tsubcount = 0; bool\tsuboverflowed = false; TransactionId replication_slot_xmin = InvalidTransactionId; TransactionId replication_slot_catalog_xmin = InvalidTransactionId; Assert(snapshot != NULL); if (snapshot-\u0026gt;xip == NULL) { /* * First call for this snapshot. Snapshot is same size whether or not * we are in recovery, see later comments. */ snapshot-\u0026gt;xip = (TransactionId *) // get current transaction\u0026#39;s xip malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId)); ... Assert(snapshot-\u0026gt;subxip == NULL); snapshot-\u0026gt;subxip = (TransactionId *) // get current subtransaction\u0026#39;s subxip malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId)); ... } // Acquire procarray; need shared LWLock LWLockAcquire(ProcArrayLock, LW_SHARED); /* xmax = max completed xid + 1 */ xmax = ShmemVariableCache-\u0026gt;latestCompletedXid; Assert(TransactionIdIsNormal(xmax)); TransactionIdAdvance(xmax); // xmax + 1 /* xmax value retrieved; xmin needs scanning pgproc, pgxact, procarray */ /* Set globalxmin and xmin to xmax first; if backends have no transaction info, this is simpler */ globalxmin = xmin = xmax; // Recovery snapshots handled separately snapshot-\u0026gt;takenDuringRecovery = RecoveryInProgress(); // Non-recovery snapshots need transaction info from backends if (!snapshot-\u0026gt;takenDuringRecovery) { int\t*pgprocnos = arrayP-\u0026gt;pgprocnos; int\tnumProcs; /* * Spin over procArray checking xid, xmin, and subxids. The goal is * to gather all active xids, find the lowest xmin, and try to record * subxids. It appears that while scanning procarray, it will spin * to collect all active xids, the smallest xmin, and subtransaction subxids. */ numProcs = arrayP-\u0026gt;numProcs; for (index = 0; index \u0026lt; numProcs; index++) { int\tpgprocno = pgprocnos[index]; // iterate numProcs, get all pgprocno indices PGXACT\t*pgxact = \u0026amp;allPgXact[pgprocno]; // iterate all pgxact structs via pgprocno TransactionId xid; ... /* Update globalxmin to be the smallest valid xmin */ xid = UINT32_ACCESS_ONCE(pgxact-\u0026gt;xmin); if (TransactionIdIsNormal(xid) \u0026amp;\u0026amp; NormalTransactionIdPrecedes(xid, globalxmin)) globalxmin = xid; /* Fetch xid just once - see GetNewTransactionId */ xid = UINT32_ACCESS_ONCE(pgxact-\u0026gt;xid); ... /* Save backend\u0026#39;s xmin into snapshot xip */ /* i.e., iterate all pgxact to find all active xids */ snapshot-\u0026gt;xip[count++] = xid; ... /* Subtransaction info handling */ if (!suboverflowed) // if subtransaction hasn\u0026#39;t overflowed { if (pgxact-\u0026gt;overflowed) suboverflowed = true; // if transaction overflowed, mark subtransaction as overflowed too else { int\tnxids = pgxact-\u0026gt;nxids; if (nxids \u0026gt; 0) { PGPROC\t*proc = \u0026amp;allProcs[pgprocno]; pg_read_barrier();\t/* pairs with GetNewTransactionId */ memcpy(snapshot-\u0026gt;subxip + subcount, (void *) proc-\u0026gt;subxids.xids, nxids * sizeof(TransactionId)); subcount += nxids; } } } } } else // the else corresponds to if (!snapshot-\u0026gt;takenDuringRecovery) { // These checks are for standby; when the instance is in hot standby mode and queries run on the replica subcount = KnownAssignedXidsGetAndSetXmin(snapshot-\u0026gt;subxip, \u0026amp;xmin, xmax); if (TransactionIdPrecedesOrEquals(xmin, procArray-\u0026gt;lastOverflowedXid)) suboverflowed = true; } // Replication slot xmin and catalog cluster-wide xmin, first save to local variables // Replication slot xmin prevents tuple reclamation // The comment says this is to avoid holding ProcArrayLock for too long, so save to local variables replication_slot_xmin = procArray-\u0026gt;replication_slot_xmin; replication_slot_catalog_xmin = procArray-\u0026gt;replication_slot_catalog_xmin; // Backend transaction info gathering is done; below is a series of ifs for cleanup and code robustness if (!TransactionIdIsValid(MyPgXact-\u0026gt;xmin)) MyPgXact-\u0026gt;xmin = TransactionXmin = xmin; LWLockRelease(ProcArrayLock); // release ProcArrayLock if (TransactionIdPrecedes(xmin, globalxmin)) globalxmin = xmin; // globalxmin and process xmin: assign globalxmin to the smaller one RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age; if (!TransactionIdIsNormal(RecentGlobalXmin)) RecentGlobalXmin = FirstNormalTransactionId; // edge case: if RecentGlobalXmin \u0026lt;= 2, assign 3 /* Check whether there\u0026#39;s a replication slot requiring an older xmin. */ if (TransactionIdIsValid(replication_slot_xmin) \u0026amp;\u0026amp; NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin)) RecentGlobalXmin = replication_slot_xmin; /* Non-catalog tables can be vacuumed if older than this xid */ RecentGlobalDataXmin = RecentGlobalXmin; // Re-check and compare catalog, globalxmin if (TransactionIdIsNormal(replication_slot_catalog_xmin) \u0026amp;\u0026amp; NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin)) RecentGlobalXmin = replication_slot_catalog_xmin; RecentXmin = xmin; // Start assigning values to the snapshot struct, returning snapshot data snapshot-\u0026gt;xmin = xmin; snapshot-\u0026gt;xmax = xmax; snapshot-\u0026gt;xcnt = count; snapshot-\u0026gt;subxcnt = subcount; snapshot-\u0026gt;suboverflowed = suboverflowed; snapshot-\u0026gt;curcid = GetCurrentCommandId(false); // If it\u0026#39;s a new snapshot, initialize some snapshot info snapshot-\u0026gt;active_count = 0; snapshot-\u0026gt;regd_count = 0; snapshot-\u0026gt;copied = false; // Snapshot-too-old logic below; oddly written here if (old_snapshot_threshold \u0026lt; 0) { /* * If not using \u0026#34;snapshot too old\u0026#34; feature, fill related fields with * dummy values that don\u0026#39;t require any locking. */ // When old_snapshot_threshold \u0026lt; 0 (no \u0026#34;snapshot too old\u0026#34; issue) // assign simple constant values that won\u0026#39;t require any locks snapshot-\u0026gt;lsn = InvalidXLogRecPtr; snapshot-\u0026gt;whenTaken = 0; } else { // When old_snapshot_threshold \u0026gt;= 0, need to handle old snapshot logic snapshot-\u0026gt;lsn = GetXLogInsertRecPtr(); // get LSN snapshot-\u0026gt;whenTaken = GetSnapshotCurrentTimestamp(); // get snapshot timestamp MaintainOldSnapshotTimeMapping(snapshot-\u0026gt;whenTaken, xmin); // // GetXLogInsertRecPtr(), GetSnapshotCurrentTimestamp(), MaintainOldSnapshotTimeMapping() // all contain SpinLockAcquire and SpinLockRelease // MaintainOldSnapshotTimeMapping() also has LWLockAcquire and LWLockRelease // Since this is called for every snapshot, GetSnapshotData should be very frequent // So in pg13 source, setting old_snapshot_threshold to negative avoids many spinlocks and lwlocks } return snapshot; } pg14 Snapshot Optimizations # pg14 Optimization Source Analysis # From the pg13 source, we can see that GetSnapshotData() hardcodes old_snapshot_threshold \u0026gt;= 0, causing each snapshot acquisition to incur many SpinLock and LWLock operations. Since snapshot acquisition is extremely frequent, this inevitably causes performance issues. So pg14 simply removed the old_snapshot_threshold logic from GetSnapshotData().\nBeyond that removal, pg14 made many other optimizations:\nRemoved RecentGlobalXmin and RecentGlobalDataXmin, added the GlobalVisTest* family of functions.\nIntroduced the boundaries concept with two boundaries: definitely_needed and maybe_needed:\nstruct GlobalVisState { /* XIDs \u0026gt;= are considered running by some backend */ // rows with XID \u0026gt;= definitely_needed are definitely visible FullTransactionId definitely_needed; /* XIDs \u0026lt; are not considered to be running by any backend */ // rows with XID \u0026lt; maybe_needed can definitely be cleaned up FullTransactionId maybe_needed; }; Added ComputeXidHorizons() for more precise horizon calculation (storing xmin and removable xid information). This function still needs to iterate PGPROC. The calculation range is XID \u0026gt;= maybe_needed \u0026amp;\u0026amp; XID \u0026lt; definitely_needed.\nAdded GlobalVisTestShouldUpdate() to determine whether boundaries need recalculation.\nFirst, understand the variable ComputeXidHorizonsResultLastXmin:\nstatic TransactionId ComputeXidHorizonsResultLastXmin; // last precisely computed xmin GlobalVisTestShouldUpdate(GlobalVisState *state) { // If xmin=0, need to recalculate boundaries. This is an edge case for tuples created during database initialization. if (!TransactionIdIsValid(ComputeXidHorizonsResultLastXmin)) return true; /* * If the maybe_needed/definitely_needed boundaries are the same, it\u0026#39;s * unlikely to be beneficial to refresh boundaries. */ // When maybe_needed equals definitely_needed, no need to recalculate // Uses FullTransactionIdFollowsOrEquals (not strict equality) // \u0026#34;Greater than\u0026#34; scenario: no rows definitely visible. \u0026#34;Equal\u0026#34; scenario: only one row definitely visible. if (FullTransactionIdFollowsOrEquals(state-\u0026gt;maybe_needed, state-\u0026gt;definitely_needed)) return false; /* does the last snapshot built have a different xmin? */ // When the last snapshot\u0026#39;s xmin equals the last precisely computed xmin, no need to recalculate boundaries return RecentXmin != ComputeXidHorizonsResultLastXmin; } We can see that maybe_needed and definitely_needed are similar to snapshot xmin/xmax, but with an additional layer of computation. First calculate boundaries, then further refine with ComputeXidHorizons(). GlobalVisTestShouldUpdate reduces the scenarios where boundaries need recalculation, and ComputeXidHorizons() is also more efficient for precise calculation.\nOptimization Results # Recommended article on PostgreSQL snapshot optimization:\nhttps://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462\nThe before-and-after comparison is striking:\nIn pg13 production environments, GetSnapshotData consistently shows high performance overhead. (No screenshot, so I\u0026rsquo;ll borrow another expert\u0026rsquo;s chart:)\nSnapshot References # Books:\nThe Internals of PostgreSQL PostgreSQL in Action PostgreSQL Internals: Deep Dive into Transaction Processing PostgreSQL Database Kernel Analysis https://edu.postgrespro.com/postgresql_internals-14_parts1-2_en.pdf\nOfficial resources:\nhttps://en.wikipedia.org/wiki/Concurrency_control\nhttps://wiki.postgresql.org/wiki/Hint_Bits\nhttps://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND\nhttps://www.postgresql.org/docs/10/storage-page-layout.html\nhttps://www.postgresql.org/docs/13/pageinspect.html3\nEssential PostgreSQL transaction reads (interdb):\nhttps://www.interdb.jp/pg/pgsql05.html\nhttps://www.interdb.jp/pg/pgsql06.html\nSource code experts:\nhttps://blog.csdn.net/Hehuyi_In/article/details/102920988\nhttps://blog.csdn.net/Hehuyi_In/article/details/127955762\nhttps://blog.csdn.net/Hehuyi_In/article/details/125023923\nPostgreSQL snapshot optimization performance comparison:\nhttps://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462\nOther resources:\nhttps://brandur.org/postgres-atomicity\nhttps://mp.weixin.qq.com/s/j-8uRuZDRf4mHIQR_ZKIEg\nVisibility Checking # With a snapshot, we can determine tuple visibility. Let\u0026rsquo;s review the key information (ignoring subtransactions for now): tuple header transaction info, snapshot info, and CLOG transaction status (before SetHintBits).\nTuple header has: xmin, xmax, cmin, cmax, infomask, etc. Snapshot data has: snapshot xmin, xmax, xip_list, curcid, etc. CLOG has additional transaction status info, which may also be written to infomask as hint bits. Different snapshot types have slightly different visibility rules:\nbool HeapTupleSatisfiesVisibility(HeapTuple tup, Snapshot snapshot, Buffer buffer) { switch (snapshot-\u0026gt;snapshot_type) { case SNAPSHOT_MVCC: return HeapTupleSatisfiesMVCC(tup, snapshot, buffer); break; ... case SNAPSHOT_NON_VACUUMABLE: return HeapTupleSatisfiesNonVacuumable(tup, snapshot, buffer); break; } ... } Each snapshot type has its own visibility rules. Here we\u0026rsquo;ll use the most common SNAPSHOT_MVCC visibility rules to understand tuple visibility.\nstatic bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { HeapTupleHeader tuple = htup-\u0026gt;t_data; Assert(ItemPointerIsValid(\u0026amp;htup-\u0026gt;t_self)); // lp valid, i.e., tuple valid Assert(htup-\u0026gt;t_tableOid != InvalidOid); // oid valid, i.e., table valid // t_xmin not committed: the transaction that INSERTed or UPDATEd this new tuple has not committed // In htup_details.h, macro: HeapTupleHeaderXminCommitted() is ((tup)-\u0026gt;t_infomask \u0026amp; HEAP_XMIN_COMMITTED) != 0 // So if (!HeapTupleHeaderXminCommitted(tuple)) means the tuple infomask does not have HEAP_XMIN_COMMITTED // Literally: t_xmin has not committed if (!HeapTupleHeaderXminCommitted(tuple)) { // If a transaction updated the tuple but then aborted or failed, this tuple\u0026#39;s xmin is the failed transaction ID // If the inserting transaction failed, directly return invisible if (HeapTupleHeaderXminInvalid(tuple)) return false; // When infomask has HEAP_MOVED_OFF, visibility is judged separately for VACUUM tuples, with hint bits set /* Used by pre-9.0 binary upgrades */ if (tuple-\u0026gt;t_infomask \u0026amp; HEAP_MOVED_OFF) { TransactionId xvac = HeapTupleHeaderGetXvac(tuple); if (TransactionIdIsCurrentTransactionId(xvac)) return false; if (!XidInMVCCSnapshot(xvac, snapshot)) { if (TransactionIdDidCommit(xvac)) { SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; } SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, InvalidTransactionId); } } // When infomask has HEAP_MOVED_IN, visibility is judged separately for VACUUM tuples, with hint bits set /* Used by pre-9.0 binary upgrades */ else if (tuple-\u0026gt;t_infomask \u0026amp; HEAP_MOVED_IN) { TransactionId xvac = HeapTupleHeaderGetXvac(tuple); if (!TransactionIdIsCurrentTransactionId(xvac)) { if (XidInMVCCSnapshot(xvac, snapshot)) return false; if (TransactionIdDidCommit(xvac)) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, InvalidTransactionId); else { SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; } } } // When the tuple was written by the current transaction else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple))) { if (HeapTupleHeaderGetCmin(tuple) \u0026gt;= snapshot-\u0026gt;curcid) // tuple cid \u0026gt;= snapshot current command id return false;\t// tuple was inserted after visibility check started; invisible if (tuple-\u0026gt;t_infomask \u0026amp; HEAP_XMAX_INVALID) // infomask has HEAP_XMAX_INVALID return true; // tuple not deleted; visible // A pure insert, whether committed, not yet committed, or rolled back, has HEAP_XMAX_INVALID // But this check is under the \u0026#34;written by current transaction\u0026#34; condition, so: // Tuple inserted by current transaction, not committed (logically equivalent to not deleted within the same tx), // and t_cid \u0026lt; curcid → visible // xmax is set in two scenarios: 1) tuple locked, 2) tuple deleted // Even without HEAP_XMAX_INVALID, the tuple may not be deleted — it may just be locked // Locked tuples have xmax set but are visible if (HEAP_XMAX_IS_LOCKED_ONLY(tuple-\u0026gt;t_infomask))\t/* not deleter */ return true; // HEAP_XMAX_IS_MULTI is set when multiple transactions acquire locks on the same row, producing MultiXactId // Still judging visibility under xmax lock scenarios if (tuple-\u0026gt;t_infomask \u0026amp; HEAP_XMAX_IS_MULTI) { TransactionId xmax; xmax = HeapTupleGetUpdateXid(tuple); /* not LOCKED_ONLY, so it has to have an xmax */ Assert(TransactionIdIsValid(xmax)); /* updating subtransaction must have aborted */ // If xmax is not the current transaction, visible if (!TransactionIdIsCurrentTransactionId(xmax)) return true; // If xmax is the current transaction, judge by command id: // snapshot acquired before update/delete → tuple was visible at snapshot time else if (HeapTupleHeaderGetCmax(tuple) \u0026gt;= snapshot-\u0026gt;curcid) return true;\t/* updated after scan started */ else return false;\t/* updated before scan started */ } // The following scenario: a subtransaction\u0026#39;s delete command was rolled back, need SetHintBits HEAP_XMAX_INVALID // Delete rolled back, so tuple is visible if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) { /* deleting subtransaction must have aborted */ SetHintBits(tuple, buffer, HEAP_XMAX_INVALID, InvalidTransactionId); return true; } // cmax is the command ID that deleted the tuple // If tuple cmax \u0026gt;= snapshot curcid: delete happened after snapshot scan → visible // If tuple cmax \u0026lt; snapshot curcid: delete happened before snapshot scan → invisible if (HeapTupleHeaderGetCmax(tuple) \u0026gt;= snapshot-\u0026gt;curcid) return true;\t/* deleted after scan started */ else return false;\t/* deleted before scan started */ } // XidInMVCCSnapshot() checks if xid was in-progress at snapshot time // \u0026#34;in-progress\u0026#34; means: 1. snapshot xmin \u0026lt;= xid \u0026lt; snapshot xmax AND xid in xip_list 2. xid \u0026gt;= snapshot xmax // The xid below is t_xmin // So this means: if t_xmin was in-progress at snapshot time → invisible // Equivalent to: t_xmin not committed → invisible. This seems redundant. // Because this whole block is under !HeapTupleHeaderXminCommitted(tuple) — also meaning t_xmin not committed. // But with the preceding checks, this else if is reasonable. Meaning: // t_xmin not committed, tuple not deleted, not current transaction → invisible else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; // If t_xmin transaction committed, SetHintBits HEAP_XMIN_COMMITTED // This seems odd: the entire block is for t_xmin NOT committed, how could it be committed here? // And if this case really happens, why no visibility judgment? else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); // If t_xmin transaction did not commit, SetHintBits HEAP_XMIN_INVALID else { /* it must have aborted or crashed */ SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); // t_xmin transaction not committed, return invisible again. Similar to XidInMVCCSnapshot() above? // Currently: not committed, and doesn\u0026#39;t satisfy XidInMVCCSnapshot() (xid was not in-progress at snapshot time) // The only case: transaction hadn\u0026#39;t started at snapshot time, later started, still not committed → invisible return false; } } // xmin-not-committed visibility judgments finally done // Everything after the else is for when xmin IS committed (hint bit HEAP_XMIN_COMMITTED is set) else { // xmin is committed, but maybe not according to our snapshot /* xmin is committed, but maybe not according to our snapshot */ // If infomask has no HEAP_XMIN_FROZEN AND xmin was in-progress at snapshot time → invisible // Translating the if: at snapshot time, xmin was not committed; at visibility check time, // tuple xmin is committed but not marked FROZEN → invisible // Even though tuple xmin is now committed, from the current snapshot\u0026#39;s perspective it was still in-progress if (!HeapTupleHeaderXminFrozen(tuple) \u0026amp;\u0026amp; XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false;\t/* treat as still in progress */ } // HEAP_XMAX_INVALID means tuple not deleted // This if means: tuple committed, and was committed at snapshot time, and not deleted (no delete marker at all) → visible if (tuple-\u0026gt;t_infomask \u0026amp; HEAP_XMAX_INVALID)\t/* xid invalid or aborted */ return true; // Tuple has xmax, but it\u0026#39;s not a delete — it\u0026#39;s a lock marker // This if means: tuple committed, was committed at snapshot time, has xmax but xmax is a lock → visible if (HEAP_XMAX_IS_LOCKED_ONLY(tuple-\u0026gt;t_infomask)) return true; // HEAP_XMAX_IS_MULTI means the tuple is in shared-row-lock state, typically when multiple transactions process one row if (tuple-\u0026gt;t_infomask \u0026amp; HEAP_XMAX_IS_MULTI) { TransactionId xmax; /* already checked above */ Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple-\u0026gt;t_infomask)); // Get the transaction ID that updated the tuple xmax = HeapTupleGetUpdateXid(tuple); /* not LOCKED_ONLY, so it has to have an xmax */ Assert(TransactionIdIsValid(xmax)); // If the shared-row-lock tuple\u0026#39;s transaction ID is the current transaction if (TransactionIdIsCurrentTransactionId(xmax)) { // tuple cmax \u0026gt;= snapshot curcid: tuple not yet deleted at snapshot time → visible if (HeapTupleHeaderGetCmax(tuple) \u0026gt;= snapshot-\u0026gt;curcid) return true;\t/* deleted after scan started */ // tuple cmax \u0026lt; snapshot curcid: tuple already deleted at snapshot time → invisible else return false;\t/* deleted before scan started */ } // If the shared-row-lock tuple\u0026#39;s transaction ID is not the current transaction, and xmax was in-progress at snapshot time // This if means: xmin committed, tuple not deleted, MULTI XMAX marker present, xmax not yet committed at snapshot time → visible if (XidInMVCCSnapshot(xmax, snapshot)) return true; // If the shared-row-lock tuple transaction committed → invisible if (TransactionIdDidCommit(xmax)) return false;\t/* updating transaction committed */ /* it must have aborted or crashed */ // Updating transaction aborted or crashed → still visible return true; } // Tuple xmin committed, xmax not yet marked committed, not yet deleted // Seems !HEAP_XMAX_COMMITTED differs from HEAP_XMAX_INVALID // This looks like: tuple experienced a delete, but the delete transaction hasn\u0026#39;t committed // While HEAP_XMAX_INVALID above is: definitely no delete or delete aborted/rolled back, so can directly return true if (!(tuple-\u0026gt;t_infomask \u0026amp; HEAP_XMAX_COMMITTED)) { // If xmax is the same as the checking transaction if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) { // Same old pattern: visibility via command id // cmax \u0026gt;= snapshot curcid: delete happened after snapshot → visible if (HeapTupleHeaderGetCmax(tuple) \u0026gt;= snapshot-\u0026gt;curcid) return true;\t/* deleted after scan started */ // cmax \u0026lt; snapshot curcid: delete happened before snapshot → invisible else return false;\t/* deleted before scan started */ } // Delete transaction not committed, and xmax not the checking transaction // If xmax was in-progress at snapshot time → visible if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot)) return true; // Confirm xmax delete transaction aborted or failed; SetHintBits HEAP_XMAX_INVALID // Similar to HEAP_XMAX_INVALID above → visible if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple))) { /* it must have aborted or crashed */ SetHintBits(tuple, buffer, HEAP_XMAX_INVALID, InvalidTransactionId); return true; } /* xmax transaction committed */ // Remaining case: xmax delete transaction committed. SetHintBits HEAP_XMAX_COMMITTED // Visibility should be judged here, but it\u0026#39;s deferred to the last few lines, because this is a sub-case of a larger condition SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED, HeapTupleHeaderGetRawXmax(tuple)); } else { /* xmax is committed, but maybe not according to our snapshot */ // xmax delete transaction now committed, but was in-progress at snapshot time → visible if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot)) return true;\t/* treat as still in progress */ } /* xmax transaction committed */ // Only remaining case: xmax committed and was not in-progress at snapshot time → invisible return false; } The entire visibility judgment source code looks complex. Stripping out the SetHintBits parts and the convoluted if-else chains, focusing only on the core visibility rules, the key points are:\nCore visibility rule logic:\nDelete committed → tuple invisible Insert committed, delete rolled back → tuple visible Insert committed, delete not committed → current transaction compares cid; other transactions see the tuple as visible Insert rolled back → tuple invisible Insert not committed → same transaction compares cmin; other transactions see the tuple as invisible Visibility checking involves two time points: the check time and the snapshot time. The logic distinguishes between the same transaction (checking transaction = snapshot transaction) and different transactions.\nSame transaction: compare tuple cmin/cmax against snapshot-\u0026gt;curcid.\ncmin \u0026gt;= snapshot-\u0026gt;curcid: tuple inserted after snapshot → invisible. Otherwise visible. cmax \u0026gt;= snapshot-\u0026gt;curcid: tuple deleted after snapshot → visible. Otherwise invisible. Different transactions: use XidInMVCCSnapshot() to check whether xid (t_xmin or t_xmax) was in-progress at snapshot time.\nxmin was in-progress at snapshot time → invisible. xmax was in-progress at snapshot time → visible. Beyond basic DML operations, there are 4 additional cases:\nVACUUM tuple insert/delete visibility Lock-only marker (HEAP_XMAX_IS_LOCKED_ONLY): tuple visible MultiXact state (HEAP_XMAX_IS_MULTI): visibility for tuples under multi-transaction locks Frozen tuples: visibility when frozen marker is set MultiXact # What Is MultiXact? # When multiple transactions lock the same row, there may be multiple associated transaction IDs on the tuple. PostgreSQL groups multiple transaction IDs together and manages them with a single MultiXactId. The relationship between TransactionId and MultiXactId is many-to-one.\nLike TransactionId, MultiXactId is also 32-bit and also subject to wraparound.\nMultiXactId values 0 and 1 are reserved for system use. Allocatable MultiXactIds start from 2.\nSource: src/include/access/multixact.h #define InvalidMultiXactId\t((MultiXactId) 0) #define FirstMultiXactId\t((MultiXactId) 1) #define MaxMultiXactId\t((MultiXactId) 0xFFFFFFFF) Row Lock Types # MultiXact only exists when rows are locked. MultiXact defines 6 states:\ntypedef enum { MultiXactStatusForKeyShare = 0x00, MultiXactStatusForShare = 0x01, MultiXactStatusForNoKeyUpdate = 0x02, MultiXactStatusForUpdate = 0x03, /* an update that doesn\u0026#39;t touch \u0026#34;key\u0026#34; columns */ MultiXactStatusNoKeyUpdate = 0x04, /* other updates, and delete */ MultiXactStatusUpdate = 0x05 } MultiXactStatus; There are 4 explicitly declarable row lock states: ForKeyShare, ForShare, ForNoKeyUpdate, ForUpdate.\nMultiXact Infomask Flags # PostgreSQL marks row locks on xmax and records them in infomask.\nSource: src/include/access/htup_details.h\n#define HEAP_XMAX_KEYSHR_LOCK\t0x0010\t/* xmax is a key-shared locker */ #define HEAP_XMAX_EXCL_LOCK\t0x0040\t/* xmax is exclusive locker */ #define HEAP_XMAX_LOCK_ONLY\t0x0080\t/* xmax, if valid, is only a locker */ #define HEAP_XMAX_SHR_LOCK\t(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK) #define HEAP_LOCK_MASK\t(HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \\ HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMAX_IS_MULTI\t0x1000\t/* t_xmax is a MultiXactId */ Here we focus on the HEAP_XMAX_IS_MULTI flag. Only when multiple transactions hold shared locks on the same row is a true MultiXact ID generated and this flag set.\nlzldb=# insert into lzl1 values(1); -- initially one row INSERT 0 1 lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+----------------------------------+---------------- (0,1) | 1 | LP_NORMAL | 742 | 0 | 0 | {HEAP_HASNULL,HEAP_XMAX_INVALID} | {} (1 row) Session 1 Session 2 lzldb=# begin; BEGIN lzldb=*# select * from lzl1 for share; a \u0026mdash; 1 lzldb=# begin; BEGIN lzldb=*# select * from lzl1 for share;\na \u0026mdash; 1 lzldb=*# update lzl1 set a=2; \u0026ndash;hang commit； UPDATE 1 \u0026ndash;update completed -- Check tuple xmax and infomask lzldb=*# select t_ctid,lp,t_xmin,t_xmax,(t_infomask\u0026amp;4096)!=0 is_multixact from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,0)); t_ctid | lp | t_xmin | t_xmax | is_multixact --------+----+--------+--------+-------------- (0,2) | 1 | 742 | 4 | t (0,2) | 2 | 744 | 3 | t HEAP_XMAX_IS_MULTI is 0x1000 in hex, which is 4096 in decimal. Using (t_infomask\u0026amp;4096)!=0 is_multixact shows whether the tuple uses a MultiXact ID. From the example:\nMultiXact IDs have their own value space, separate from transaction IDs. MultiXact IDs are generally smaller than transaction IDs — here t_xmax \u0026lt; t_xmin. For an UPDATE, old and new tuples typically share the same xmax. In MultiXact scenarios, they may differ. MultiXact SLRU # Although src/backend/access/transam/multixact.c defines many variables and functions at the top — page, member, membergroup, offset — they are all about defining variable values and conversion functions between them.\nBefore reading multixact.c, understand a few macros:\nsrc/include/c.h defines MultiXactOffset as a 32-bit type:\ntypedef uint32 MultiXactOffset; src/include/access/slru.h defines how many SLRU pages per segment:\n#define SLRU_PAGES_PER_SEGMENT\t32 Back to the top of src/backend/access/transam/multixact.c:\ndefine MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset)) // MULTIXACT_OFFSETS_PER_PAGE = 8k / 32B = 2048. One page stores 2048 offset markers, i.e., 2048 MultiXactIds. #define MultiXactIdToOffsetPage(xid) \\ ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE) // Convert xid to the page where the corresponding record resides: xid / 2048 #define MultiXactIdToOffsetEntry(xid) \\ ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE) // Convert xid to the offset within the page: xid % 2048 #define MultiXactIdToOffsetSegment(xid) (MultiXactIdToOffsetPage(xid) / SLRU_PAGES_PER_SEGMENT) // Convert xid to the segment: xid / 2048 / 32 Now read the comments at the top of the source:\n/* * Defines for MultiXactOffset page sizes. A page is the same BLCKSZ as is * used everywhere else in Postgres. * * Note: because MultiXactOffsets are 32 bits and wrap around at 0xFFFFFFFF, * MultiXact page numbering also wraps around at * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE, and segment numbering at * 0xFFFFFFFF/MULTIXACT_OFFSETS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need * take no explicit notice of that fact in this module, except when comparing * segment and page numbers in TruncateMultiXact (see * MultiXactOffsetPagePrecedes). */ Since MultiXactOffsets are 32-bit and subject to wraparound:\nMultiXact page numbering wraps at 0xFFFFFFFF / MULTIXACT_OFFSETS_PER_PAGE = 2^32 / 2048 = 2^21 Segment numbering wraps at 0xFFFFFFFF / MULTIXACT_OFFSETS_PER_PAGE / SLRU_PAGES_PER_SEGMENT = 2^32 / 2^11 / 2^5 = 2^16 TruncateMultiXact() cleans up these segments and page numbers. It is called by VACUUM.\nThe pg_multixact Directory # Like CLOG and SUBTRANS, MultiXact logs use an SLRU buffer pool implementation. The pg_multixact directory has only two subdirectories: members and offsets.\n[pg@lzl pg_multixact]$ ll total 8 drwx------ 2 pg pg 4096 Feb 14 21:29 members drwx------ 2 pg pg 4096 Feb 14 21:29 offsets One MultiXactId corresponds to multiple TransactionIds — the members. The offset is the starting position of each MultiXact.\ntypedef struct mXactCacheEnt { MultiXactId multi; // one MultiXactId int\tnmembers; dlist_node\tnode; MultiXactMember members[FLEXIBLE_ARRAY_MEMBER]; // multiple TransactionIds; expanded via MultiXactIdExpand() if needed } mXactCacheEnt; multixact.h defines MultiXactMember as just a single transaction ID and its status:\ntypedef struct MultiXactMember { TransactionId xid; MultiXactStatus status; } MultiXactMember; MultiXact References # https://www.postgresql.org/docs/current/routine-vacuuming.html\nhttps://pgpedia.info/m/multixact-id.html\nhttps://www.postgresql.org/docs/15/explicit-locking.html\nhttps://www.modb.pro/db/14939\nhttps://www.highgo.ca/2020/06/12/transactions-in-postgresql-and-their-mechanism/\nTwo-Phase Commit (2PC) Transactions # What Is a 2PC Transaction? # Transaction atomicity requires that a transaction either completes entirely or rolls back entirely. In distributed transactions spanning multiple connected databases, a consistent state must be provided to satisfy distributed transaction atomicity. Like other databases, PostgreSQL provides the Two-Phase Commit Protocol (2PC).\nThere are many distributed transaction implementations; 2PC is the most fundamental and common. Distributed transactions encompass atomic commit, atomic visibility, and global consistency. 2PC is only an implementation for atomic commit.\nPREPARE TRANSACTION # Foreign Data Wrappers (FDWs) can handle 2PC internally. PostgreSQL also provides an explicit way to use 2PC: PREPARE TRANSACTION. Once issued, the prepared transaction is detached from the session; its state is persisted. PREPARE TRANSACTION is not designed for use in applications or interactive sessions — unless you\u0026rsquo;re writing a transaction manager — so it is recommended (and default) to keep it disabled.\nSyntax:\nPREPARE TRANSACTION transaction_id COMMIT PREPARED transaction_id ROLLBACK PREPARED transaction_id Notes:\nThe transaction_id here is not the internal transaction ID — it\u0026rsquo;s just a user-declared string. PREPARE TRANSACTION must be inside a transaction block, started with BEGIN or START TRANSACTION. max_prepared_transactions controls the number of prepared transactions. Default is 0 (disabled). Must be increased to use prepared transactions. Starting a Prepared Transaction # lzldb=# begin; BEGIN lzldb=*# PREPARE TRANSACTION \u0026#39;lzl\u0026#39;; PREPARE TRANSACTION lzldb=# select * from pg_prepared_xacts ; transaction | gid | prepared | owner | database -------------+-----+-------------------------------+-------+---------- 719 | lzl | 2023-04-29 16:08:45.866022+08 | pg | lzldb (1 row) lzldb=# rollback prepared \u0026#39;lzl\u0026#39;; ROLLBACK PREPARED lzldb=# select * from pg_prepared_xacts ; transaction | gid | prepared | owner | database -------------+-----+----------+-------+---------- (0 rows) The pg_twophase Directory # As mentioned, prepared transactions are session-independent. When a prepared transaction is started, its state information is stored in a cache. To ensure the transaction is not lost, prepared transactions are also persisted to the pg_twophase directory. This doesn\u0026rsquo;t only happen on shutdown — it\u0026rsquo;s tied to checkpoint.\nSource: src/backend/access/transam/twophase.c\nvoid CheckPointTwoPhase(XLogRecPtr redo_horizon) { ... TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_START(); // checkpoint start ... fsync_fname(TWOPHASE_DIR, true); // call fsync to flush to disk TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE(); // checkpoint done ... } Let\u0026rsquo;s test: start a prepared transaction and run a checkpoint:\n[pg@lzl pg_twophase]$ ll total 0 lzldb=*# PREPARE TRANSACTION \u0026#39;lzl\u0026#39;; PREPARE TRANSACTION lzldb=# checkpoint; CHECKPOINT [pg@lzl pg_twophase]$ ll total 4 -rw------- 1 pg pg 116 Apr 29 16:33 000002D0 Orphaned Prepared Transactions # If a prepared transaction is never completed (neither committed nor rolled back), and since it is session-independent, it will persist unless explicitly terminated. (Normally, a regular transaction rolls back when the session disconnects.) This is an orphaned prepared transaction.\nOrphaned prepared transactions hold locks and tuple resources indefinitely, preventing VACUUM from reclaiming dead tuples and even blocking transaction ID wraparound. For example, if a prepared transaction is forgotten and not committed or rolled back, and there is no external transaction management monitoring it, it may go unnoticed and exist forever — ultimately causing severe problems. Therefore, it\u0026rsquo;s recommended to keep max_prepared_transactions=0 (default) or monitor prepared transactions via the pg_prepared_xacts view.\nHere\u0026rsquo;s a simulation of an orphaned prepared transaction causing indefinite blocking:\n-- Start a prepared transaction and disconnect lzldb=# begin; BEGIN lzldb=*# insert into lzl1 values(1); INSERT 0 1 lzldb=*# PREPARE TRANSACTION \u0026#39;lzl\u0026#39;; PREPARE TRANSACTION lzldb=# \\q -- After disconnecting, the prepared transaction still exists postgres=# select * from pg_prepared_xacts ; transaction | gid | prepared | owner | database -------------+-----+-------------------------------+-------+---------- 721 | lzl | 2023-04-29 17:08:59.597678+08 | pg | lzldb -- DDL blocked lzldb=# alter table lzl1 add column b int; -- Check locks lzldb=# select locktype,relation,pid,mode from pg_locks where relation=32808; locktype | relation | pid | mode ----------+----------+-------+--------------------- relation | 32808 | 26136 | AccessExclusiveLock relation | 32808 | | RowExclusiveLock -- End the prepared transaction; DDL completes lzldb=# rollback prepared \u0026#39;lzl\u0026#39;; ROLLBACK PREPARED lzldb=# alter table lzl1 add column b int; ALTER TABLE 2PC Transaction References # http://postgres.cn/docs/13/sql-prepare-transaction.html\nhttps://www.highgo.ca/2020/01/28/understanding-prepared-transactions-and-handling-the-orphans/\nhttps://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions\nSubtransactions # What Is a Subtransaction? # A regular transaction can only commit or roll back as a whole. Subtransactions allow partial rollback.\nSAVEPOINT p1 places a savepoint marker inside a transaction. You cannot directly commit a subtransaction — subtransactions are committed when the parent transaction commits. However, you can use ROLLBACK TO SAVEPOINT p1 to roll back to that savepoint.\nSubtransactions are useful for bulk data loading. If a transaction contains multiple subtransactions and one small segment fails, only that segment needs to be retried — not the entire transaction.\nUsing Subtransactions in SQL # SAVEPOINT savepoint_name ROLLBACK [ WORK | TRANSACTION ] TO [ SAVEPOINT ] savepoint_name RELEASE [ SAVEPOINT ] savepoint_name Notes:\nSavepoint statements must be inside a transaction block. SAVEPOINT creates a savepoint; ROLLBACK TO rolls back to the named savepoint; RELEASE erases the savepoint without rolling back subtransaction data. Cursors are not affected by savepoint operations. Example:\nlzldb=# begin; BEGIN lzldb=*# insert into lzl1 values(0); INSERT 0 1 lzldb=*# savepoint p1; SAVEPOINT lzldb=*# insert into lzl1 values(1); INSERT 0 1 lzldb=*# savepoint p2; SAVEPOINT lzldb=*# insert into lzl1 values(2); INSERT 0 1 lzldb=*# savepoint p3; SAVEPOINT lzldb=*# insert into lzl1 values(3); INSERT 0 1 lzldb=*# rollback to savepoint p2; ROLLBACK lzldb=*# commit; COMMIT lzldb=# select xmin,xmax,cmin,a from lzl1; xmin | xmax | cmin | a ------+------+------+--- 731 | 0 | 0 | 0 732 | 0 | 1 | 1 (2 rows) -- Rolling back to p2 also rolled back p3 lzldb=# select * from vlzl1; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+------------------------------------------------------+---------------- (0,1) | 1 | LP_NORMAL | 731 | 0 | 0 | {HEAP_HASNULL,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} (0,2) | 2 | LP_NORMAL | 732 | 0 | 1 | {HEAP_HASNULL,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} (0,3) | 3 | LP_NORMAL | 733 | 0 | 2 | {HEAP_HASNULL,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {} (0,4) | 4 | LP_NORMAL | 734 | 0 | 3 | {HEAP_HASNULL,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {} (4 rows) -- Subtransaction infomask is not very different from regular transactions. -- Multiple commands within the same transaction are differentiated by cid and HEAP_XMIN_INVALID, etc. -- Subtransaction writes also consume transaction IDs, and cid increments within the parent transaction framework. Other Sources of Subtransactions # Even without explicit SAVEPOINT, subtransactions can be created by other means:\nEXCEPTION blocks trigger subtransactions. This is common in tools and frameworks and easily overlooked. Every EXCEPTION creates a subtransaction.\nSyntax: BEGIN / EXCEPTION WHEN .. / END\nReference: https://fluca1978.github.io/2020/02/05/PLPGSQLExceptions.html\nPL/Python code using plpy.subtransaction().\nSubtransaction SLRU Cache # Subtransaction commit logs are in pg_xact. Parent-child relationships are stored in pg_subtrans, which caches the mapping of subXID to parent XID. When PostgreSQL needs to look up a subXID, it calculates which memory page the ID resides on and searches within that page. If the page is not in cache, it evicts a page and loads the required page from pg_subtrans into memory. Large numbers of subtransaction cache misses consume system I/O and CPU.\nThe subtransaction buffer is only 32 pages, hardcoded in the source.\nSource: src/include/access/subtrans.h\n/* Number of SLRU buffers to use for subtrans */ \\#define NUM_SUBTRANS_BUFFERS 32 Buffer default is 8KB; xid is 32 bits (4 bytes). Therefore:\nSUBTRANS_BUFFER size: 32 * 8K = 256KB SUBTRANS_BUFFER can store at most: 32 * 8K / 4 = 65,536 xids Finding a subtransaction\u0026rsquo;s position in a page by transaction ID:\nSource: src/backend/access/transam/subtrans.c\n/* We need four bytes per xact */ #define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId)) // Each page can store up to 8K / 4 bytes = 2048 subtransaction IDs #define TransactionIdToPage(xid) ((xid) / (TransactionId) SUBTRANS_XACTS_PER_PAGE) // Calculate page number from subtransaction xid: xid / 2048 #define TransactionIdToEntry(xid) ((xid) % (TransactionId) SUBTRANS_XACTS_PER_PAGE) // Calculate offset within page from subtransaction xid: xid % 2048 Subtransaction xids may not be densely packed within a page — a page may hold fewer than 2048 subtransaction IDs.\nThe Dangers of Subtransactions # 1. PGPROC_MAX_CACHED_SUBXIDS Overflow\nPGPROC_MAX_CACHED_SUBXIDS is not a GUC parameter — it\u0026rsquo;s hardcoded. You can only change it by modifying the source.\nSource: src/include/storage/proc.h\n/* *Each backend has a subtransaction cache limit of PGPROC_MAX_CACHED_SUBXIDS. *We must track whether the cache has overflowed (i.e., the transaction has at least one subtransaction that couldn\u0026#39;t be cached). *If no cache has overflowed, we can be sure that an xid not in the PGPROC array is definitely not a running transaction. *If there is an overflow, we must consult pg_subtrans. */ #define PGPROC_MAX_CACHED_SUBXIDS 64\t/* XXX guessed-at value */ struct XidCache { TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS]; }; Two key takeaways from this source:\nEvery backend\u0026rsquo;s subtransaction cache is capped at PGPROC_MAX_CACHED_SUBXIDS: 64 subtransactions. Beyond 64 subtransactions, they overflow to the pg_subtrans directory. An expert\u0026rsquo;s benchmark: performance drops when subtransactions just exceed 64. So it\u0026rsquo;s best to keep per-session subtransactions below 64.\nReference: https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful\n2. Subtransactions Causing MultiXact Contention\nReference: https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/\nFOR UPDATE itself is a row-level exclusive lock and should not generate a MultiXact ID. But in this scenario, multiple MultiXact waits occurred, causing a cliff-like performance drop:\nLWLock:MultiXactMemberControlLock LWLock:MultiXactOffsetControlLock LWLock:multixact_member LwLock:multixact_offset It was later discovered that the Django framework was issuing subtransaction statements:\nSELECT [some row] FOR UPDATE; SAVEPOINT save; UPDATE [the same row]; 3. Replica Performance Cliff\nReference: https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/\nA single long transaction with a savepoint subtransaction can also cause a performance cliff on replicas.\nIf a read occurs on a snapshot taken on the primary, the snapshot includes xmin, xmax, the txip transaction list, and subxip (the list of in-progress subtransactions). However, neither the original arrays nor the snapshot are directly shared with replicas — replicas read all needed data from WAL.\nWhen subtransactions exist, a single long-running transaction can cause replica performance to drop off a cliff:\n4. Production Performance Cliff\nWhen the database is busy and many subtransactions exist, performance can drop sharply, accompanied by subtransaction wait events. This scenario can occur even when per-session subtransactions don\u0026rsquo;t exceed 64, and even on the primary (not just replicas).\nWe found that a tool (OGG) defaulted to 50 subtransactions. Reducing the subtransaction count in that tool to 10–20 alleviated the database performance issue.\nSubtransaction usage recommendations:\nBesides explicit SAVEPOINT, EXCEPTION blocks, frameworks, and tools can also generate subtransactions. If you have replica query workloads, disable subtransactions. Use row locks cautiously. FOR UPDATE + subtransactions can also trigger MultiXactId issues. If you must use subtransactions, keep them well below 64 per session — preferably much lower. Subtransactions have caused countless production issues worldwide, with many case studies and analyses. To quote: \u0026ldquo;Subtransactions are basically cursed. Rip \u0026rsquo;em out.\u0026rdquo;\nSubtransaction References # https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful\nhttps://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/\nhttps://fluca1978.github.io/2020/02/05/PLPGSQLExceptions.html\nhttps://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/\nhttps://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/\nReferences # Books:\nThe Internals of PostgreSQL PostgreSQL in Action PostgreSQL Internals: Deep Dive into Transaction Processing PostgreSQL Database Kernel Analysis https://edu.postgrespro.com/postgresql_internals-14_parts1-2_en.pdf\nOfficial resources:\nhttps://en.wikipedia.org/wiki/Concurrency_control\nhttps://wiki.postgresql.org/wiki/Hint_Bits\nhttps://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND\nhttps://www.postgresql.org/docs/10/storage-page-layout.html\nhttps://www.postgresql.org/docs/13/pageinspect.html3\nEssential PostgreSQL transaction reads (interdb):\nhttps://www.interdb.jp/pg/pgsql05.html\nhttps://www.interdb.jp/pg/pgsql06.html\nSource code experts:\nhttps://blog.csdn.net/Hehuyi_In/article/details/102920988\nhttps://blog.csdn.net/Hehuyi_In/article/details/127955762\nhttps://blog.csdn.net/Hehuyi_In/article/details/125023923\nPostgreSQL snapshot optimization performance comparison:\nhttps://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462\nOther resources:\nhttps://brandur.org/postgres-atomicity\nhttps://mp.weixin.qq.com/s/j-8uRuZDRf4mHIQR_ZKIEg\nhttps://blog.csdn.net/postgrechina/article/details/49130743?spm=a2c6h.12873639.article-detail.7.41b32cda2KR1QM\nhttp://mysql.taobao.org/monthly/2018/12/02/\nOriginally published in Chinese on lastdba.com.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/a-deep-dive-into-postgresql-transactions/","section":"Posts","summary":"PostgreSQL Transactions\nTo guarantee ACID properties, an RDBMS must implement concurrency control. PostgreSQL, like Oracle and MySQL (InnoDB), uses MVCC (Multi-Version Concurrency Control) for concurrency control. MVCC works by continuously generating new versions of objects as data changes while allowing queries to access a bounded range of older versions. It captures a snapshot of data at a given point in time and selects one version to read.\nOracle and MySQL both use undo segments to record old versions of objects. PostgreSQL has no undo. Instead, during DML operations it writes historical data directly into the original table (UPDATE creates a new row, DELETE marks the row) and records additional columns — xmin and xmax — in the table to store transaction IDs. By comparing transaction IDs and other metadata, PostgreSQL implements its MVCC mechanism.\n","title":"A Deep Dive into PostgreSQL Transactions","type":"posts"},{"content":" Process Memory Analysis # \u0026#34;WAL writer process (PID 66902) was terminated by signal 6: Aborted\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;postmaster\u0026#34; The log shows postmaster process 66902 was killed.\nChecking OS-level process memory: since top doesn\u0026rsquo;t show PPID and ps doesn\u0026rsquo;t show USS, we need both:\nUSER PID PPID PRI %CPU %MEM VSZ RSS WCHAN S STARTED TIME COMMAND postgres 211276 66478 19 8.7 10.6 57488380 56389972 - R 17:13:03 00:02:47 postgres: BIND postgres 211277 66478 19 7.8 9.6 52294700 51127480 - R 17:13:03 00:02:31 postgres: BIND postgres 222749 66478 19 22.7 9.3 51320000 49073368 - R 17:35:33 00:02:09 postgres: BIND postgres 39513 66478 19 2.9 6.8 38651084 36354736 ep_poll S 16:13:03 00:02:43 postgres: idle Using PPID to identify high-memory backend processes. Let\u0026rsquo;s examine process 211276:\n[postgres@lzl]$ zcat /osw/oswtop/toposw.dat.gz |grep 211276 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 211276 postgres 20 0 3271756 1.1g 1.1g S 7.3 0.2 0:03.93 postgres 211276 postgres 20 0 3291784 1.3g 1.2g R 96.4 0.2 0:11.87 postgres 211276 postgres 20 0 7369628 6.0g 2.1g R 100.0 1.2 0:46.58 postgres 211276 postgres 20 0 17.0g 15.9g 2.1g R 100.0 3.2 1:16.70 postgres 211276 postgres 20 0 28.8g 27.7g 2.1g R 100.0 5.5 1:46.82 postgres 211276 postgres 20 0 41.4g 40.4g 2.1g R 100.0 8.0 2:16.99 postgres 211276 postgres 20 0 54.7g 53.7g 2.1g R 88.8 10.7 2:47.60 postgres 211276 postgres 20 0 66.5g 64.9g 2.1g R 34.7 12.9 3:22.76 postgres 211276 postgres 20 0 71.0g 68.2g 2.1g R 99.1 13.6 3:52.94 postgres 211276 postgres 20 0 74.9g 71.2g 2.1g R 100.0 14.2 4:23.05 postgres 211276 postgres 20 0 0 0 0 R 100.0 0.0 4:45.65 postgres We can estimate private memory via RES - SHR = USS. Process 211276\u0026rsquo;s memory ballooned from ~1GB to ~70GB within minutes, then crashed. All memory growth was private process memory.\nSQL Analysis # The PostgreSQL log shows a 5MB SQL containing 5,000+ UNION ALLs and 30,000+ bind variables.\nThe execution plan is over 70,000 lines long:\nAppend (cost=218196.51..218216.28 rows=1318 width=1628) InitPlan 1 (returns $0) -\u0026gt; Index Scan using table1 on table1nfo (cost=0.29..5.31 rows=1 width=40) Index Cond: ((col1)::text = \u0026#39;xxx\u0026#39;::text) Filter: ((colcolcol)::text = \u0026#39;xxx\u0026#39;::text) InitPlan 2 (returns $1) -\u0026gt; Index Scan using table1 on table1nfo table1nfo_1 (cost=0.29..5.31 rows=1 width=40) Index Cond: ((col1)::text = \u0026#39;xxx\u0026#39;::text) Filter: ((colcolcol)::text = \u0026#39;xxx\u0026#39;::text) ... InitPlan 10544 (returns $10543) -\u0026gt; Aggregate (cost=5.58..5.59 rows=1 width=32) -\u0026gt; Index Scan using table2 on table2col t_1317 (cost=0.56..5.58 rows=1 width=19) Index Cond: ((ididid)::text = \u0026#39;xxx\u0026#39;::text) Filter: ((idididid)::text = \u0026#39;1\u0026#39;::text) The plan structure is simple: ~10,000 sub-plans fetching data, then an Append to combine results.\nThis SQL monstrosity pushed a single backend process to 70GB. The root cause is clear: reduce the UNION ALLs and the problem goes away (which is indeed what happened). But if we dig deeper, many interesting questions arise:\nWhy did a 5MB SQL consume 70GB of memory? Is the data itself related to memory usage? Was it caused by returning too many rows? Is the memory from parsing cache or plan cache? Why didn\u0026rsquo;t work_mem limit the operation memory, even though it\u0026rsquo;s set to a reasonable value? Initial Analysis # A 5MB SQL cached in a backend would at minimum contain: metadata, parsed SQL, and plan cache information.\nWe\u0026rsquo;ve seen cases before where metadata cache (relcache) for hundreds of thousands of tables/partitions caused huge backend memory. But this database has few tables, so relcache can be preliminarily ruled out (later confirmed by memory dump).\nParsed SQL data shouldn\u0026rsquo;t be that large — a 5MB SQL parsed shouldn\u0026rsquo;t produce 70GB.\nwork_mem limitations and more:\nwork_mem only limits per-operation memory for sort and hash operations. This creates the \u0026ldquo;multiple sort/hash\u0026rdquo; problem: a single SQL with many sorts can use work_mem × N. PG 13 introduced hash_mem_multiplier to cap hash usage within one statement. But what about sorts? Currently no multiplier for sorts, though in practice it\u0026rsquo;s less of a problem — statements with dozens of sort nodes are rare, as they carry high cost, and the optimizer tends to place sorts late in the plan.\nHere, work_mem is 128MB and the instance is PG 13+ with hash_mem_multiplier=1, so mass hash memory consumption can be ruled out. Furthermore, the execution plan above has zero sort or hash operations, confirming this is not a sort/hash issue.\nSo the earlier question: \u0026ldquo;Why didn\u0026rsquo;t work_mem limit operation memory?\u0026rdquo;\nBecause the SQL only has UNION ALL — no sort or hash operations at all. work_mem does not constrain memory here.\nOther plan nodes:\nNo matter what, work_mem only (!) limits sort/hash. There are dozens of plan node types — are the rest all unconstrained?\nReproduction and Deep Analysis # Empty Table Reproduction # --Create empty table create table lzl1(col1 varchar(1)); --Query with many UNION ALLs select col1 from lzl1 union all select col1 from lzl1 union all ...(5000 UNION ALLs, SQL size 150KB) select col1 from lzl1 (Too many UNION ALLs may exceed max_stack_depth)\nAn empty table + many UNION ALLs immediately reproduces the memory spike. Moreover, after the SQL completes, the backend memory is reclaimed.\nSince this is an empty table (0KB data file), we can rule out data as the cause. So: \u0026ldquo;Is the data itself related to memory? Was it caused by returning too many rows?\u0026rdquo; — No, data is not the main factor.\nStrace System Call Analysis # While executing the SQL, capture system calls with strace -p:\nstrace -p 198337 \u0026gt; strace.198337 2\u0026gt;\u0026amp;1 Quick primer on relevant Linux syscalls:\nepoll_wait: Wait for an event. Idle processes sit in this state. recvfrom: Receive a message from a socket. getrusage: Get resource usage. brk: Program break. Increasing it allocates memory to the process; decreasing it deallocates. malloc ultimately calls brk. lseek: Reposition file offset. write: Write to a file descriptor. Does not guarantee disk write. sendto: Send a message on a socket. Syscalls like lseek, write, sendto include fd (file descriptor) information:\nlseek(37, 0, SEEK_END) = 0 /proc/[pid]/fd caches the process\u0026rsquo;s file descriptors. We can map an fd back to a relation — fd 37 is table lzl1:\n[postgres@lzl]$ cd /proc/198337/fd [postgres@lzl]$ ll 37 lrwx------ 1 postgres postgres 64 Jan 26 22:59 37 -\u0026gt; /pgdata/lzl/data13/base/16385/16386 [postgres@lzl]$ oid2name -d lzldb -f 16386 From database \u0026#34;lzldb\u0026#34;: Filenode Table Name ---------------------- 16386 lzl1 The strace output is dense but structurally simple:\nstrace: Process 198337 attached epoll_wait(4, [{EPOLLIN, {u32=44314568, u64=44314568}}], 1, -1) = 1 ## step1 recvfrom(9, \u0026#34;Q\\0\\2p\\372select col1 from lzl1 union\u0026#34;..., 8192, 0, NULL, NULL) = 8192 recvfrom(9, \u0026#34; all\\nselect col1 from lzl1 union\u0026#34;..., 8192, 0, NULL, NULL) = 8192 recvfrom(9, \u0026#34; all\\nselect col1 from lzl1 union\u0026#34;..., 8192, 0, NULL, NULL) = 8192 ... recvfrom(9, \u0026#34; all\\nselect col1 from lzl1 union\u0026#34;..., 8192, 0, NULL, NULL) = 8192 recvfrom(9, \u0026#34; all\\nselect col1 from lzl1 union\u0026#34;..., 8192, 0, NULL, NULL) = 4347 ## step2 brk(NULL) = 0x34d5000 brk(0x3cd5000) = 0x3cd5000 brk(NULL) = 0x3cd5000 ... brk(NULL) = 0x88cd6000 brk(0x894d6000) = 0x894d6000 ## step3 lseek(37, 0, SEEK_END) = 0 lseek(37, 0, SEEK_END) = 0 ... lseek(37, 0, SEEK_END) = 0 ## step4 brk(NULL) = 0x89cd6000 brk(0x8a4d6000) = 0x8a4d6000 brk(NULL) = 0x8a4d6000 ... brk(NULL) = 0x8a516000 brk(0x8a556000) = 0x8a556000 ## step5 write(2, \u0026#34;2024-01-26 23:08:01.800 CST [198\u0026#34;..., 165521) = 165521 brk(NULL) = 0x8a556000 brk(0x8a57d000) = 0x8a57d000 brk(NULL) = 0x8a57d000 brk(0x8a59f000) = 0x8a59f000 ... brk(NULL) = 0x8d449000 brk(0x8d46b000) = 0x8d46b000 brk(NULL) = 0x8d46b000 brk(0x8d48d000) = 0x8d48d000 #step6 lseek(37, 0, SEEK_END) = 0 lseek(37, 0, SEEK_END) = 0 ... lseek(37, 0, SEEK_END) = 0 #step7 brk(NULL) = 0x8dcb1000 brk(NULL) = 0x8dcb1000 brk(0x8c179000) = 0x8c179000 brk(NULL) = 0x8c179000 brk(NULL) = 0x8c179000 brk(NULL) = 0x8c179000 brk(0x8a526000) = 0x8a526000 ... brk(0x34d5000) = 0x34d5000 brk(NULL) = 0x34d5000 #step8 sendto(8, \u0026#34;\\2\\0\\0\\0\\230\\0\\0\\0\\1@\\0\\0\\1\\0\\0\\0\\1\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\u0026#34;..., 152, 0, NULL, 0) = 152 sendto(9, \u0026#34;T\\0\\0\\0\\35\\0\\1col1\\0\\0\\0\\0\\0\\0\\0\\0\\0\\4\\23\\377\\377\\0\\0\\0\\5\\0\\0C\\0\u0026#34;..., 50, 0, NULL, 0) = 50 #step9 recvfrom(9, 0xddcf60, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, strace: Process 198337 detached \u0026lt;detached ...\u0026gt; Receive the UNION ALL SQL from fd=9 socket brk allocates memory: process memory grows from 0x34d5000 (54MB) to 0x894d6000 (2.1GB) lseek on table lzl1 Memory grows ~4MB write to fd=2 (log file); memory grows ~48MB lseek on table lzl1 Memory peaks at 0x8dcb1000 (2.1GB), then brk releases memory back down to 0x34d5000 (54MB) — exactly matching the start Send result via socket Receive empty message from fd=9 The strace doesn\u0026rsquo;t reveal much beyond the OS allocating and releasing memory for the process.\nMemory Dump Analysis # pmap of the process during the memory spike:\n[postgres@lzl pg_log]$ pmap -x 76207 76207: postgres: postgres lzldb [local] SELECT Address Kbytes RSS Dirty Mode Mapping 0000000000400000 7984 2192 0 r-x-- postgres 0000000000dcc000 4 4 4 r---- postgres 0000000000dcd000 60 60 60 rw--- postgres 0000000000ddc000 200 60 60 rw--- [ anon ] 0000000001e49000 264 224 224 rw--- [ anon ] 0000000001e8b000 1812380 1804400 1804400 rw--- [ anon ] ... ffffffffff600000 4 0 0 r-x-- [ anon ] ---------------- ------- ------- ------- total kB 2089384 1810232 1807384 pmap doesn\u0026rsquo;t label the segments, but we can see the largest segment starts at address 0x1e49000. Checking smaps for more detail:\n[postgres@lzl 76207]$ cat smaps |grep 1e49000 -A 30 01e49000-01e8b000 rw-p 00000000 00:00 0 [heap] Size: 264 kB ... 01e8b000-70872000 rw-p 00000000 00:00 0 [heap] Size: 1812380 kB Rss: 1804400 kB Pss: 1804400 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 1804400 kB Referenced: 1804400 kB Anonymous: 1804400 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Heap segment. PSS (private memory): 1.8GB!\n(I tried using gdb to dump the 0x1e8b000-0x70872000 segment but it failed — not sure why. Suggestions welcome!)\nUsing gcore for a rough dump:\n[postgres@lzl lzl]$ gcore -o /pgdata/lzl/gcore.dump 76207 [postgres@lzl lzl]$ strings gcore.dump.76207\u0026gt; text.dump.76207 [postgres@lzl lzl]$ ll -h -rw-r----- 1 postgres postgres 2.0G Jan 26 17:29 gcore.dump.76207 -rw-r----- 1 postgres postgres 5.2M Jan 26 17:30 text.dump.76207 2GB virtual memory allocated, 1.8GB physical memory occupied — but only 5.2MB of actual data stored!\nA rough hexdump reveals many memory holes:\n[postgres@lzl lzl]$ hexdump -C gcore.dump.76207 |head -10000 |grep \u0026#34;00 00 00 00 00 00 00 00\u0026#34;|wc -l 3690 log_planner_stats and Other Info # To verify whether the plan cache is the culprit, enable logging for parse, planner, and executor phases:\nlog_parser_stats = on log_planner_stats = on log_executor_stats = on The logs show the parse phase uses little memory, while the planner consumes significantly more.\nPlanner stats log:\n2024-01-26 18:01:41.592 CST [208503] LOG: PLANNER STATISTICS 2024-01-26 18:01:41.592 CST [208503] DETAIL: ! system usage stats: ! 0.048955 s user, 0.004996 s system, 0.054077 s elapsed ! [11.208034 s user, 1.313838 s system total] ! 2255352 kB max resident size ! 0/0 [0/352] filesystem blocks in/out ! 0/1315 [0/563859] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [1/16] voluntary/involuntary context switches 2GB max resident size — consistent with the RES observed from the OS. This answers: \u0026ldquo;Is the memory from parsing cache or plan cache?\u0026rdquo; — The planner phase consumes the memory.\nInspecting TopMemoryContext # PostgreSQL manages backend private memory through MemoryContext. We can dump TopMemoryContext via gdb:\nTopMemoryContext: 101488 total in 6 blocks; 48464 free (28 chunks); 53024 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used TableSpace cache: 8192 total in 1 blocks; 2048 free (0 chunks); 6144 used RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used MessageContext: 1854981336 total in 235 blocks; 7911304 free (9 chunks); 1847070032 used ... Grand total: 1856104056 bytes in 431 blocks; 8226712 free (179 chunks); 1847877344 used MessageContext accounts for 1.8GB — the largest consumer.\nFrom src/backend/utils/mmgr/README:\nMessageContext \u0026mdash; this context holds the current command message from the frontend, as well as any derived storage that need only live as long as the current message (for example, in simple-Query mode the parse and plan trees can live here). This context will be reset, and any children deleted, at the top of each cycle of the outer loop of PostgresMain. This is kept separate from per-transaction and per-portal contexts because a query string might need to live either a longer or shorter time than any single transaction or portal.\nWhen creating a prepared statement, the parse and plan trees will be built in a temporary context that\u0026rsquo;s a child of MessageContext.\nMessageContext caches messages from the frontend, including derived parse and plan tree data. Parse and plan trees are children of MessageContext — when MessageContext is reclaimed, parse and plan trees are reclaimed too. This also explains the private memory reclamation: the plan tree data produced during the planner phase is a child of MessageContext. Once results are returned, MessageContext is reset and all children are freed. This matches the strace observation where memory after release matches memory before allocation exactly.\nSummary # Answering the final question: \u0026ldquo;Why did a 5MB SQL consume 70GB of memory?\u0026rdquo;\nThe overwhelming majority of memory was consumed during plan creation. The planner allocated enormous amounts of memory. work_mem and hash_mem_multiplier can only constrain sort and hash operations — they cannot limit other memory operations during planning. The plan tree itself isn\u0026rsquo;t that large, but the allocation process creates massive memory holes: megabyte-scale data (metadata, parse tree, plan tree, etc.) ends up stored in gigabyte-scale memory regions.\nThese SQL, parse tree, and plan tree structures are all cached in MessageContext and its children. Once the result is sent back to the client, all memory from this phase is reclaimed.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/analyzing-a-5mb-sql-that-consumed-70gb-of-memory/","section":"Posts","summary":"Process Memory Analysis # \"WAL writer process (PID 66902) was terminated by signal 6: Aborted\",,,,,,,,,\"\",\"postmaster\" The log shows postmaster process 66902 was killed.\nChecking OS-level process memory: since top doesn’t show PPID and ps doesn’t show USS, we need both:\nUSER PID PPID PRI %CPU %MEM VSZ RSS WCHAN S STARTED TIME COMMAND postgres 211276 66478 19 8.7 10.6 57488380 56389972 - R 17:13:03 00:02:47 postgres: BIND postgres 211277 66478 19 7.8 9.6 52294700 51127480 - R 17:13:03 00:02:31 postgres: BIND postgres 222749 66478 19 22.7 9.3 51320000 49073368 - R 17:35:33 00:02:09 postgres: BIND postgres 39513 66478 19 2.9 6.8 38651084 36354736 ep_poll S 16:13:03 00:02:43 postgres: idle Using PPID to identify high-memory backend processes. Let’s examine process 211276:\n","title":"Analyzing a 5MB SQL That Consumed 70GB of Memory","type":"posts"},{"content":"​ ​​\nArthur C. Clarke's masterpiece — a work no sci-fi fan can afford to skip. I'd long heard of its reputation, but having already seen the film adaptation, I felt it lacked some novelty, so the book just sat on my shelf unread. But after reading it, I can say with complete confidence: every page is filled with freshness — the kind of dopamine-driven reading that makes it impossible to put down. God-Tier Predictions # This book was published in the 1960s — more than 60 years ago from now (2023). What is science fiction? Sci-fi makes reasonably plausible predictions about the future based on current science. And the author, living in the 1960s, imagined humanity\u0026rsquo;s space exploration in the year 2000. We, living in the present, are perfectly positioned to verify his \u0026ldquo;future world.\u0026rdquo;\nOf course, these prophetic predictions aren\u0026rsquo;t perfectly accurate. For example, his forecast for manned space travel is clearly a bit too optimistic. After the Apollo program ended, we\u0026rsquo;ve never again undertaken a practice that breaks free from Earth\u0026rsquo;s bounds — not even returning to the Moon\u0026hellip;\nIn the novel\u0026rsquo;s year 2000, humanity already has a luxurious Moon base and dispatches astronauts aboard a spacecraft bound for Jupiter.\nBut you can\u0026rsquo;t really blame the author. The book was published in 1968, and the very next year, humans landed on the Moon. Given a few more decades, landing on Jupiter should\u0026rsquo;ve been feasible, right?\nThe novel contains many astonishing predictions — here are a few that left a deep impression:\nPopulation: Arthur C. Clarke predicted with stunning accuracy that the global population would explode to 6 billion by 2000 (in the 1960s it was 3 billion). He even foresaw certain countries implementing birth control due to overpopulation, limiting families to two children. (Clearly conservative, right\u0026hellip; The Celestial Empire had already started family planning, and only one child was allowed — until young people stopped wanting children altogether.)\nPandemic control: In the year 2000, a global pandemic spreads, with quarantine zones set up everywhere\u0026hellip; (I have no f***ing words.)\nArtificial intelligence: In 1946, von Neumann invented the computer — the concept was just emerging — yet Arthur C. Clarke was already emphasizing the concept of artificial intelligence, predicting AI\u0026rsquo;s control over vast, complex systems. Even more remarkably, he had already imagined AI potentially rebelling against humans\u0026hellip; ChatGPT was only recognized this year. The more you think about it, the more chilling it gets~\nTablet computers: Home computers didn\u0026rsquo;t appear until the 1980s, yet in the novel, people are already using tablet computers to control system inputs and read the news\u0026hellip; Because the novel is so hardcore, Clarke even describes switching between a news homepage and category pages on a tablet, with data analysis delivering content tailored to the user\u0026hellip;\nTriple-site mirroring: As a DBA, I\u0026rsquo;m hyper-sensitive to this term. The author describes data center mirror backups, with data split into three identical copies stored in different locations on Earth for disaster recovery\u0026hellip; I\u0026rsquo;m not entirely sure when concepts like \u0026ldquo;two-site-three-center\u0026rdquo; or \u0026ldquo;three-site-five-center\u0026rdquo; were first proposed (though I imagine not long ago), but seeing the novel describe data mirroring and remote disaster recovery in such detail genuinely struck a chord with my DBA instincts.\nReading this masterpiece, my state of mind was: shock, then more shock, then nonstop shock~ How did Arthur C. Clarke, in the 1960s, conceive of this future world? Unimaginable. No wonder some people say: \u0026ldquo;Arthur C. Clarke time-traveled to the present, then went back to the 1960s to write this work.\u0026rdquo;\nImagination # If it were merely scientific prediction, it couldn't truly be called science fiction. Sci-fi can't just be cold scientific extrapolation — it needs a touch of humanistic distillation, a bit of imagination that departs from science, like Liu Cixin's portrayals of human nature. This element of imagination beyond science is precisely what ultimately determines a sci-fi work's stature. And the ultimate imaginative conceit of *2001: A Space Odyssey* is the TMA-1 monolith and the Star Child. The TMA-1 monolith is an alien artifact that catalyzes human evolution, and it simultaneously represents the vast gap between human science and alien science. The entire novel revolves around this monolith — it is the very core of the entire sci-fi story. In fact, the monolith only appears at two points in time: the ape-men era and the beginning of humanity's space exploration. When the ape-men first encounter the monolith, their physical structure undergoes subtle changes — their hands become more dexterous, their brains begin to think. The author then uses several chapters to describe the ape-men's transformation: 1. This group of ape-men masters tools. In a confrontation with a leopard, for the first time in history, they gain the upper hand — marking the first time they stand at the top of the food chain, no longer prey. 2. This group of ape-men decisively triumphs in a struggle against another group of apes — marking their transformation from ape-men into humans. Then, the novel leaps over millions of years of human history, cutting directly to the era of space travel. This technique is utterly brilliant~ The second time: a lone human, after countless hardships, reaches the monolith on Saturn (Jupiter in the film). The protagonist passes through a wormhole pre-arranged by the alien beings, experiences a journey through space, witnesses many wondrous cosmic spectacles, and finally falls into a room — the Star Child is born! The alien life guided ape-men to become humans, then guided humans to become the Star Child. The Star Child is pure imagination — built on the analogy of ape-men becoming humans, marked by the TMA-1 monolith. Imaginative elements are added perfectly and naturally, leaving a profound, lingering aftertaste. Worthy of being a seminal work in science fiction. Old Liu (Liu Cixin) # I read quite a few of Liu Cixin's works during university — *The Three-Body Problem*, *The Wandering Earth*, *Ball Lightning*, *Earth Cannon*... I really like *The Three-Body Problem*, but I have no interest in the excessive factional disputes in the first book — I even found them a bit contrived. However, the concept of understanding Trisolaran society through the Three-Body game is brilliantly executed. *The Dark Forest* is clearly much better — arguably the most thrilling book in the trilogy. Back when I finished these works, I had a feeling *The Wandering Earth* might be adapted into a film; the others seemed harder to film... Liu Cixin's sci-fi works feature strong narrative suspense and abundant human conflict, focusing more on human behavior against a cosmic backdrop. Arthur C. Clarke's works, by contrast, rarely dwell on interpersonal relationships. He prefers depicting the face of future society and the bizarre wonders of stars, planets, and space travel. Many parts of Liu Cixin's work clearly show the influence of *Space Odyssey*. When Clarke describes TMA-1, he uses the word \u0026quot;smooth\u0026quot; — clearly the \u0026quot;droplet\u0026quot; in *The Three-Body Problem* references this concept. Both are technological products of alien civilizations beyond human comprehension, though their purposes are vastly different~ Speaking of which, Liu Cixin hasn't released a new work in over a decade — what's he up to... The Film — 2001: A Space Odyssey # Released in 1968, another masterpiece by Kubrick — the god of sci-fi meets the god of cinema. That iconic BGM swells~ When the ape-man throws the bone — the tool — into the sky, and as it falls, the shot cuts to millions of years later... An exquisitely brilliant piece of cinematic language, truly stirring~ When I first watched this film, there were many parts I didn't fully understand. After reading the novel, everything falls into place. The film also adds many classic scenes, such as: 1. The depiction of Earth's orbital space in the year 2000. After over 30 years of development, humanity has launched countless capsules into space — the sky is filled with all manner of spacecraft. This sequence was frequently referenced before the year 2000. 2. HAL 9000 reading the astronauts' lips and learning they plan to shut him down. I assumed this scene was in the novel, but the book's portrayal of taking down the AI is far more circuitous. Both are brilliant, though. (The film *The Wandering Earth*'s MOSS pays heavy homage to HAL 9000.) Closing # *Space Odyssey* perfectly embodies what hard sci-fi should be: god-tier predictions about the future, paired with a finishing touch of pure imagination. I read this book far too late — I absolutely must read the sequels soon! ​\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-2001-a-space-odyssey/","section":"Posts","summary":"​ ​​\nArthur C. Clarke's masterpiece — a work no sci-fi fan can afford to skip. I'd long heard of its reputation, but having already seen the film adaptation, I felt it lacked some novelty, so the book just sat on my shelf unread. But after reading it, I can say with complete confidence: every page is filled with freshness — the kind of dopamine-driven reading that makes it impossible to put down. God-Tier Predictions # This book was published in the 1960s — more than 60 years ago from now (2023). What is science fiction? Sci-fi makes reasonably plausible predictions about the future based on current science. And the author, living in the 1960s, imagined humanity’s space exploration in the year 2000. We, living in the present, are perfectly positioned to verify his “future world.”\n","title":"Book Notes — 2001: A Space Odyssey","type":"posts"},{"content":" Preface # My previous book was Wild — the Pacific Crest Trail queen mentioned this book, Are We Smart Enough to Know How Smart Animals Are?, noting how she\u0026rsquo;d read it page by page, tearing each one out after reading. I wonder: as she journeyed through mountains and forests, hearing birdsong and streams, did reading this book about how clever animals are feel especially resonant?\nI\u0026rsquo;d previously read Sapiens (I can\u0026rsquo;t help recommending this book — it\u0026rsquo;s incredible). That book starts from when humans first stood upright and traces our journey until we gradually became gods\u0026hellip; What exactly makes humans different — what allows us to stand out from the myriad of living creatures?\nThe author, Frans de Waal, is an expert in primate behavior — the most cutting-edge and popular field within all animal behavior studies. Especially as experimental methods have improved, we\u0026rsquo;ve discovered that those traits humans keep proudly claiming as uniquely ours have all been found in other animal groups.\nApes # This book is highly scientific, containing extensive descriptions of experiments, observations, and the development of biological science. Since it\u0026rsquo;s science, let\u0026rsquo;s learn something~ When you see the word \u0026ldquo;ape,\u0026rdquo; what kind of ape image comes to mind? Whatever it is, it\u0026rsquo;s not precise enough. Because \u0026ldquo;ape\u0026rdquo; is a general term — you can roughly divide apes into four types: chimpanzees, gorillas, orangutans, and gibbons (bonobos are likely a branch of chimpanzees, frequently mentioned in the book; I\u0026rsquo;ll set them aside for simplicity):\nHomo Sapiens Chimpanzee Gorilla Orangutan Gibbon Homo Sapien chimpanzee gorilla orangutan hylobates /\u0026gt; /\u0026gt; /\u0026gt; /\u0026gt; Kinship:\nHominoidea means the family of \u0026ldquo;hominoids\u0026rdquo; — and yes, all these close relatives of ours belong to the hominid family! The other Homixxx entries are smaller tribal branches. From the family tree above, we can see that we Homo sapiens are most closely related to chimpanzees, with gorillas, orangutans, and gibbons increasingly distant.\nEvolutionary timeline: About six million years ago, we and chimpanzees were still the same species\u0026hellip; Chimpanzees are also universally recognized as the most intelligent animals. Did we really evolve from monkeys? This description isn\u0026rsquo;t quite accurate. Although the diagram above doesn\u0026rsquo;t mark monkeys, going further back we certainly share a common ancestor. But that doesn\u0026rsquo;t mean we evolved from monkeys — just like chimpanzees, we share a common ancestor that is now extinct. So we didn\u0026rsquo;t evolve from monkeys, but we and monkeys share a common ancestor — just two different branches. \u0026ldquo;Although for convenience we often use \u0026lsquo;animals\u0026rsquo; to refer to non-human species, it\u0026rsquo;s undeniable that humans are a kind of animal.\u0026rdquo;\nWhat Makes Us Different? # Tool use?\nAfter reading Space Odyssey, I thought what made humans human was our learning to use tools. From the moment we grasped tools in our hands to crack open bone marrow, to humanity venturing into space to explore the unknown — all because we learned to use tools. But we can easily find similar behaviors in other animals. Chimpanzees use twigs to eat ants, and use branches as ladders to climb over walls. Even their thumbs, like ours, can grasp objects. Tool use is actually quite common in the animal kingdom. It seems tool use is not a uniquely human trait — those animals that also possess this skill haven\u0026rsquo;t developed higher civilizations.\nThe Cognitive Revolution?\nAfter reading Sapiens, there was one particularly novel idea. I long and firmly believed it was correct: the Cognitive Revolution. The author argues that the Cognitive Revolution was the crucial juncture where Homo sapiens diverged dramatically from other animals. The Cognitive Revolution occurred before the Agricultural Revolution, when sapiens were still just hunters. The author gives a classic example: one person discovers a lion by the river, and returns to tell the rest of the tribe — \u0026ldquo;There\u0026rsquo;s a lion by the river.\u0026rdquo; At that moment, even though no one else has seen it with their own eyes, they all believe in their minds the concept of \u0026ldquo;there\u0026rsquo;s a lion by the river.\u0026rdquo; This transmission of belief later gave rise to religion, power, nations, currency, corporations, and other virtual concepts. Are We Smart Enough to Know How Smart Animals Are? offers a counterexample: a monkey, being bullied by two others, cornered with no escape, lets out a \u0026ldquo;snake!\u0026rdquo; cry (the call they only make when they encounter snakes). The two other monkeys stop to check whether there really is a snake — only when they confirm there isn\u0026rsquo;t one do they resume the chase. Many observations show that numerous animals possess the ability to believe through others\u0026rsquo; stories.\nUpright walking?\nUpright walking freed our hands, and our brains grew increasingly developed. This is described in Sapiens. In fact, bipedal walking isn\u0026rsquo;t as special as we imagine. Bonobos on the savannah can walk on two legs for extended periods.\nLanguage?\nLanguage was once thought to belong to humans alone. Just because we can\u0026rsquo;t understand what animals are saying doesn\u0026rsquo;t mean they lack simple language. Animals\u0026rsquo; various calls are not innate. When a chimpanzee grows up with one group, their calls in different situations are similar. If you place that chimpanzee in a different, unrelated chimpanzee group, researchers found their calls are completely different — and for a long time, that chimpanzee cannot integrate into the new group until it learns the new calling patterns. Some once believed language influences how we think. But to think, language is not a necessity. The ability of animals to add different numbers was once thought to depend on language, yet in an experiment, a chimpanzee successfully added numbers.\nCooperation?\nThe Wandering Earth 2 has this scene: a minister shows a fossilized human bone that was broken and healed — proof that this human suffered a severe injury. Among other animals, the injured would be abandoned, but this person received help from others and survived. Is cooperation the dividing line between humans and animals? Chimpanzee groups help elderly chimpanzees with limited mobility — bringing them food, feeding them water mouth-to-mouth.\nComplex social relationships?\nChimpanzees not only know their own relationship with other chimpanzees, but also understand the relationships between B and C. Even when encountering an unfamiliar chimpanzee, they can assess its social status through how other chimpanzees treat it, and behave accordingly.\nThinking about the future?\nAbsolutely no problem at all\u0026hellip;\nPlato proposed that humans are the only featherless bipeds. Diogenes then plucked a chicken and said: \u0026ldquo;Behold — Plato\u0026rsquo;s \u0026lsquo;man.\u0026rsquo;\u0026rdquo; We can keep adding qualifiers to this definition until we can no longer find a description that fits only humans and no other animal. Humans and animals are certainly different — of course we can find the most fitting description of humans from many perspectives. But isn\u0026rsquo;t that a bit too subjective?\nAlthough this book refutes various claims of difference, the author does not deny that humans are special. In some respects, we are clearly unique. But we have yet to find that distinguishing point — at least, no consensus has been reached. If we want to find the essential difference between humans and animals, we must first discard the presupposition that \u0026ldquo;humans are special.\u0026rdquo;\nClosing # First, a complaint about the Chinese translation — it screams machine translation. For example: \u0026ldquo;人们认为动物善于学习行为的普遍后果，但无法记住任何特定的联系\u0026rdquo; (\u0026ldquo;People believe animals are good at learning the general consequences of behavior, but cannot remember any specific connections\u0026rdquo;). It\u0026rsquo;s very hard to understand this sentence using direct Chinese thinking — it reads exactly like a machine-translated sentence. But if you think in English, it\u0026rsquo;s instantly clear: the sentence means \u0026ldquo;People believe animals are good at learning the consequences of behaviors but do not know the connection between the behavior and the consequence\u0026rdquo; (the author is refuting this statement).\nAre We Smart Enough to Know How Smart Animals Are? has a strong academic atmosphere. It uses a wealth of reliable experiments and observations to explain the essence of animal behavior. Reading this book feels a bit like reading a paper — logically rigorous and cautiously worded in its claims. Primate studies, as the frontier of animal behavior research, hold great significance for studying human behavior — though some other animals\u0026rsquo; behaviors are also useful.\nThe book contains many ideas that spark sudden flashes of insight: Clever Hans, the impossibility of equal testing environments for human infants and apes, the homology of all vertebrate brains, chimpanzees\u0026rsquo; astonishing memory and logical reasoning abilities, chimpanzee power struggles, and more. Frans de Waal\u0026rsquo;s other book Chimpanzee Politics has already been added to my reading list\u0026hellip;\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-are-we-smart-enough-to-know-how-smart-animals-are/","section":"Posts","summary":" Preface # My previous book was Wild — the Pacific Crest Trail queen mentioned this book, Are We Smart Enough to Know How Smart Animals Are?, noting how she’d read it page by page, tearing each one out after reading. I wonder: as she journeyed through mountains and forests, hearing birdsong and streams, did reading this book about how clever animals are feel especially resonant?\nI’d previously read Sapiens (I can’t help recommending this book — it’s incredible). That book starts from when humans first stood upright and traces our journey until we gradually became gods… What exactly makes humans different — what allows us to stand out from the myriad of living creatures?\n","title":"Book Notes — Are We Smart Enough to Know How Smart Animals Are?","type":"posts"},{"content":" Preface # Frans de Waal\u0026rsquo;s seminal work Chimpanzee Politics was published in 1982 — his first book and also recommended reading for incoming members of the U.S. Congress. Another work of his I read previously, Are We Smart Enough to Know How Smart Animals Are?, was from 2016 — such a vast timespan between them. Are We Smart Enough introduced many animal behaviors, including those of humanity\u0026rsquo;s numerous close relatives, while Chimpanzee Politics focuses solely on our very closest relative — the chimpanzee. It observes a chimpanzee colony in a zoo and analyzes the structure, evolution, and behaviors of chimpanzee social power and politics.\nIf you see chimpanzees at the zoo mating brazenly in broad daylight without any inhibitions, or screaming and attacking one another — seemingly devoid of moral restraint, showing no trace of civilization — then the English title of Are We Smart Enough serves as a perfect retort: \u0026ldquo;Are We Smart Enough to Know How Smart Animals Are?\u0026rdquo;\nPower and Alliances # It\u0026rsquo;s commonly assumed that in animal social structures, the strongest male becomes the leader. This does broadly align with chimpanzee social structure. But it\u0026rsquo;s far from that simple — physical strength is not the sole factor determining dominance relationships. Alliances are the crucial factor, perhaps the most important factor. The book spends extensive passages discussing \u0026ldquo;triangular relationships.\u0026rdquo; Here, I need to introduce the book\u0026rsquo;s three main chimpanzee protagonists:\nYeroen (the elder) — Luit (the middle) — Nikkie (the young)\nThese three male chimpanzees form a power center — the power core of this chimpanzee colony — and their political struggles play out on this political stage. All three have, at different times, been the colony\u0026rsquo;s alpha. Initially, the capable and broadly respected Yeroen was alpha. Then Luit took over. Finally, Nikkie established a puppet-style rule. They built a hierarchical organization and competed within it for dominance over the rest of the group.\nFirst: a male with superior fighting ability cannot simply usurp the group\u0026rsquo;s leadership. Power collapses not when a challenger defeats the current ruler in combat, but when the ruler can no longer protect other members of the society. During Luit\u0026rsquo;s bid for power, Luit and his ally Nikkie constantly attacked other group members, and when the Luit-Nikkie alliance was present together, Yeroen could not offer protection to others.\nThe Luit-Nikkie alliance played a decisive role in toppling the Yeroen dynasty. But Yeroen\u0026rsquo;s fall from power also created new alliance opportunities — just like human politicians, chimpanzees seize such opportunities too. Yeroen found the key player in the current \u0026ldquo;triangular relationship\u0026rdquo;: Nikkie.\nBefore Yeroen\u0026rsquo;s fall, Nikkie was Luit\u0026rsquo;s ally. Afterward, Nikkie became Yeroen\u0026rsquo;s ally. Why would the seasoned Yeroen support Nikkie after losing power?\nFor Nikkie: he went from number two to number one. He was the \u0026ldquo;person\u0026rdquo; most eager for Yeroen\u0026rsquo;s support. For Yeroen: an alliance with Nikkie secured his position as number two in the group, and Nikkie — relative to Yeroen — needed his support more. Nikkie couldn\u0026rsquo;t openly oppose Yeroen, because if he did, Nikkie\u0026rsquo;s own position would become unstable. Yeroen gained more freedom of action and traded it for more mating opportunities with females. As for Luit: he dropped from the top of the power rankings to number three. The Yeroen-Nikkie alliance, though tight, featured a very cunning Yeroen. Although Yeroen\u0026rsquo;s relationship with Luit was terrible, Yeroen would still proactively approach Luit — and Nikkie would invariably intervene, without exception. Why did Yeroen seek contact with Luit? Yeroen approached Luit precisely to put on a show for Nikkie. For Nikkie, Yeroen\u0026rsquo;s behavior served as a constant reminder that Nikkie\u0026rsquo;s position depended entirely on Yeroen\u0026rsquo;s choices. The young Nikkie lacked strong grassroots support from the group. The seasoned, cunning Yeroen held Nikkie in the palm of his hand — Nikkie\u0026rsquo;s ruling foundation did not rest under his own feet.\nWhen one chimpanzee grooms another\u0026rsquo;s fur, this is not merely a simple biological act — it\u0026rsquo;s a reflection of the two chimpanzees\u0026rsquo; social relationship, signifying that their bond is sufficiently strong, or that one seeks a favor from the other. A classic scenario in the triangular relationship: Nikkie (center) grooms his ally Yeroen (left), while Luit (right) sits alone at a short distance.\nMales and Females in Power # Although males are generally stronger than females, male chimpanzees do not use their full strength when attacking females. Males only bite and tear at each other when facing another male.\nSocial mammal groups are typically composed of many females and a few males. Females also play an important role in power struggles.\nFemale chimpanzees tend to avoid competition because they need a safer, more stable environment to raise offspring. Power transitions in the group do not happen instantaneously — when Luit replaced Yeroen, the process took over two months. During those two months, the two chimpanzees repeatedly fought and reconciled. Female chimpanzees played a vital mediating role in this process. Females would proactively embrace both of them, breaking the tension during confrontations and working hard to push them toward reconciliation.\nMale leadership arises from strength, alliances, and support levels. Females also have a leader, but female leadership is determined by character and age. Females almost never need to fight each other; the probability of conflict between females is extremely low, and their hierarchical order can persist for many years.\nSocial psychologists, through alliance-game testing, have found that males take more proactive action, while females place more emphasis on the atmosphere of the game. In competitive activities, men are all about achieving strategic objectives — they prefer to seize the \u0026ldquo;big\u0026rdquo; events. Women are more interested in individual connections, forming alliances with those they like, and they focus on the immediate rather than distant political goals. Of course, these are statistical tendencies — exceptions always exist.\nPower and Sex # Avoiding incest is a moral or legal constraint in human society, often considered part of human culture. If mating were purposeless, would group-living chimpanzees have incest problems? In reality, such problems are extremely rare. Chimpanzees actively avoid incest. Mothers know who their sons are, and when a son reaches adulthood, chimpanzee mothers absolutely will not tolerate incestuous behavior. Young chimpanzees may not know who their fathers are, but they strongly resist mating with males roughly their father\u0026rsquo;s age. Biologists believe incest avoidance is a natural law deeply embedded in culture.\nPower and sex are certainly linked. Chimpanzee alphas typically enjoy extremely high mating privileges — until overthrown by a rebel. But these mating privileges occur during ordinary times; female chimpanzees will secretly mate with males they treated coldly during the day — at night, or in places the alpha can\u0026rsquo;t see, like in the tall grass. How similar this is to human society needs no elaboration.\nJealousy produces more offspring. Chimpanzee social structure includes multiple females and males. More jealous males will do everything to prevent other males from contacting females, giving themselves more opportunities to sire offspring — and those offspring, in turn, will also be more jealous. Females, however, are entirely different: no matter whom she mates with, her number of offspring is fixed, and the offspring are always hers. So jealousy among females is not pronounced. But in pair-bonding species, things look completely different — in pair-bonding species, females also engage in sexual competition. In such cases, females are more inclined to maintain long-term relationships with males. In modern human society, men care more about whether their female partner has had sex with another man; women care more about whether their partner has fallen in love with another woman. At its essence, even the cornerstone of human society — the family — is merely a unit of sex and reproduction.\nClosing # There\u0026rsquo;s actually a lot more interesting material I haven\u0026rsquo;t gotten to — too lazy to expand further. Some perspectives I personally really like:\n\u0026ldquo;Humans are engaged in continuous office competition while simultaneously uniting against a common enemy.\u0026rdquo; \u0026ldquo;Hierarchical order is a cohesive factor that imposes limits on competition and conflict.\u0026rdquo; \u0026ldquo;The roots of politics are far older than humanity.\u0026rdquo; Universal Safety Disclaimer # Large portions of this article are drawn from the book Chimpanzee Politics and do not represent my personal views.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-chimpanzee-politics-power-and-sex-among-apes/","section":"Posts","summary":"Preface # Frans de Waal’s seminal work Chimpanzee Politics was published in 1982 — his first book and also recommended reading for incoming members of the U.S. Congress. Another work of his I read previously, Are We Smart Enough to Know How Smart Animals Are?, was from 2016 — such a vast timespan between them. Are We Smart Enough introduced many animal behaviors, including those of humanity’s numerous close relatives, while Chimpanzee Politics focuses solely on our very closest relative — the chimpanzee. It observes a chimpanzee colony in a zoo and analyzes the structure, evolution, and behaviors of chimpanzee social power and politics.\n","title":"Book Notes — Chimpanzee Politics: Power and Sex among Apes","type":"posts"},{"content":" Preface # Homo Deus: A Brief History of Tomorrow is one of the trilogy by Israeli historian Yuval Noah Harari. The trilogy consists of Sapiens: A Brief History of Humankind, Homo Deus: A Brief History of Tomorrow, and 21 Lessons for the 21st Century. The most famous, of course, is Sapiens — an extraordinarily sweeping book about the history of human civilization that can absolutely reshape your view of history. Last year (2022), I stubbornly gnawed through the English original of Sapiens page by page — quite an achievement. Because I loved Sapiens so much, Homo Deus, the sequel from this giant of a thinker, naturally became this year\u0026rsquo;s most important \u0026ldquo;extracurricular reading.\u0026rdquo;\nSapiens tells the story of human history — from Homo sapiens standing upright to launching rockets to explore the stars: how did we get here? Homo Deus discusses the critical issues currently facing human civilization, and where we are headed.\nThis copy of Homo Deus was hard to come by. In the end, I bought a second-hand Chinese edition from JD — it came from the library of Xingtan Liang Qiuju Middle School \u0026#x1f604;. When I opened to the first page, a cheeky middle schooler had left a line of English. Let\u0026rsquo;s start with that:\nWhen facing the ultimate questions of this chaotic world, we need Chinese readers to contribute their wisdom.\nThe New Agenda # Famine # Open almost any history book, and you\u0026rsquo;ll read about the horrors of famine and the insane behavior of people pushed to starvation. There\u0026rsquo;s no need to bring up famines in other countries — the most noteworthy case is right here in China. From the earliest written records all the way to the 20th century, China suffered the ravages of famine for thousands of years. We\u0026rsquo;ve always been an agricultural nation; nearly everyone had to work the land to feed themselves and their families. If crops failed — due to natural disasters (too much or too little rain, locust plagues, etc.) or human interference (bandits, oppressive taxes, irregular planting) — some people would face food shortages. Most modern people have no idea what it feels like to go without food for days on end. I\u0026rsquo;ve been hungry for stretches myself, and I know that prolonged hunger is a misery the average person can\u0026rsquo;t imagine — but even I was never at risk of starving to death. Yet our ancestors, facing the prospect of actually starving to death, what kind of despair must they have felt? They had no solution but to pray to the gods for favorable weather and a bountiful harvest the following year.\nThere\u0026rsquo;s a line from House of Cards that really stuck with me: \u0026ldquo;Twenty years ago, I couldn\u0026rsquo;t buy sugar in China. Now I can buy it anywhere.\u0026rdquo; Crude as it sounds, it reflects a reality: the Chinese people have escaped poverty. For the first time in Chinese history, we are no longer tormented by famine. We created this economic miracle — something worth recording! Similarly, human civilization as a whole has recently solved the problem of hunger. Food shortages in particular regions are almost always caused by political factors, and internationally, there are ample surplus resources for emergency response to shortages. Food scarcity is no longer a human agenda item.\nOn the contrary, humanity is no longer concerned with food shortages but is starting to worry about food surpluses. Health problems caused by obesity and malnutrition far outnumber those caused by starvation. Many people mindlessly chew through bread, rice, and loads of carbohydrates without getting enough protein and vitamins. The rich eat lettuce salads; poor Westerners eat cake, burgers, and pizza; and I eat fried dough sticks, steamed buns, rice, and noodles — my weight keeps climbing every day, and my health problems multiply year by year\u0026hellip;\nBacteria and Viruses # The Black Death: In the 1330s, the Black Death — the bacterium Yersinia pestis — caused 70 to 200 million deaths worldwide, with a mortality rate of roughly 50%. The Spanish Flu: 1918. Infected 500 million people; 50 to 100 million died. Mortality rate around 15%. Smallpox: 1967 — 15 million infected, 2 million deaths. Mortality rate about 15%. Following global smallpox vaccination, the smallpox virus was eradicated by humanity in 1979. AIDS: Broke out in the 1980s. Over 30 million deaths. Destroys the immune system. Current medications are effective but cannot provide a perfect cure. Infection rate: 0.9%. Mortality rate: 1.28 per 100,000. SARS: 2003. 8,000 infected, over 700 deaths. Avian Flu: Fewer than 1,000 deaths. H1N1 Swine Flu: 2009. 700 million to 1.4 billion infected. Approximately 150,000 to 600,000 deaths. Infection rate: 20%. Mortality rate: ~0.02%. Ebola: Multiple outbreaks in Africa. Mortality rate above 50%. Of the above, only the Black Death is bacterial; all the rest are viral. The Black Death is too ancient; though bacterial, due to the primitive state of medical care at the time, people had no idea what was happening, leading to massive casualties and an extraordinarily high fatality rate. Smallpox is humanity\u0026rsquo;s greatest success story in the war against viruses — through modern medicine and vaccines, we outright eliminated the smallpox virus. As you can see, humanity has developed a silver bullet for bacteria — antibiotics. Bacterial epidemics are essentially gone. But for viral influenzas, they keep emerging in an endless cycle: as one subsides, another rises. There\u0026rsquo;s no great solution; modern medicine still has room to improve against viral epidemics. Major viral pandemics still strike every few years, and seasonal flu never stops accompanying us.\nMost of these influenzas are weathered by the human immune system alone — modern medicine only plays a supporting role (basically, bringing down fevers). Especially for human infants: aside from getting all manner of vaccines right after birth, every other \u0026ldquo;cold\u0026rdquo; has to be tough out by their own immune systems, with very few effective medications available. Kindergarten is less a place of learning and more a trial ground for human influenza and immune resistance.\nCOVID-19: Homo Deus was published in 2015, before COVID-19 happened. The author\u0026rsquo;s view on epidemics was: \u0026ldquo;Doctors can quickly get up to speed and rapidly discover treatments — humanity has probably already conquered epidemics.\u0026rdquo; I wonder what Yuval Noah Harari makes of COVID-19. Regarding COVID-19, there\u0026rsquo;s simply too much I want to say. So many grievances that they defy coherent complaint. In a single sentence: \u0026ldquo;On the matter of COVID-19, humanity was utterly shattered and exposed in all its ugliness.\u0026rdquo;\nWar # Skipped.\nHumanism # The concept of humanism emerged during the Renaissance, championing human rights and individual value in opposition to the religious theocracy of the time. It reached China roughly in the late Qing dynasty. Humanism has had a profound impact on modern society: people increasingly emphasize the concepts of the individual or the collective, rather than top-down religious dogma and the divine right of kings.\nHumanism advocates human rights against divine right, and individual freedom against personal dependency. What humanism worships is human nature — the human being itself.\nPeople are always contemplating the meaning of life. Humanism holds that humanity itself is the source of meaning: \u0026ldquo;I am the meaning.\u0026rdquo; It also holds that free will is the highest authority. Humanism proposes a new life principle: \u0026ldquo;If I feel it\u0026rsquo;s good, it\u0026rsquo;s good; if I feel it\u0026rsquo;s bad, it\u0026rsquo;s bad.\u0026rdquo; For example: if a woman has an affair, in pre-humanist society, she would face punishment from religion and social norms — the censure of priests and elders. In modern society, she need only heed her own true feelings; the best approach is to ask her own heart what it thinks.\nFor society as a whole: what everyone believes is good is \u0026ldquo;good\u0026rdquo;; what everyone believes is bad is \u0026ldquo;bad.\u0026rdquo; Take theft, for example. For the victim, it\u0026rsquo;s certainly bad. For everyone else, it\u0026rsquo;s also bad — because others don\u0026rsquo;t want to be stolen from either, including thieves themselves. Thus, theft is bad, and people can even write it into a mutually binding document. By the same logic, if a certain behavior feels bad to no one at all, then it\u0026rsquo;s not wrong. This naturally leads to the question of homosexuality: two people of the same sex feel that this is good, and it affects no one else — therefore, it\u0026rsquo;s not wrong. So humanism supports homosexuality and opposes religion.\nHumanism can perfectly address these two types of extreme questions. But for events that are good for some and bad for others — like the trolley problem — it\u0026rsquo;s much harder to answer. In ancient societies, Confucianism advocated that women remain faithful to one husband unto death, even erecting chastity archways. In modern society, as long as one can find happy days, people don\u0026rsquo;t want to stay bound in misery. But what if divorce leads to happiness for one side and utter misery for the other? Add the emotional harm to the children, and the whole situation becomes very hard to measure: whose happiness matters more? Humanism will only tell you: \u0026ldquo;Follow your own heart.\u0026quot;~\nAs humanism gained broader acceptance, it evolved into three major branches:\nLiberal Humanism: The \u0026ldquo;orthodox\u0026rdquo; liberal humanism, also known as liberalism. The individual enjoys freedom; individual choice is respected. If it feels right to each person, it\u0026rsquo;s right. The classic example is liberalism\u0026rsquo;s belief that the ballot box represents individual will. But this requires one precondition: before voting, everyone must be \u0026ldquo;one of us.\u0026rdquo; For instance, the American North and South in 1861, or Israel and Palestine today — neither could possibly resolve their issues by having everyone vote together.\nSocialist Humanism: Socialist humanism doesn\u0026rsquo;t focus on individual feelings, viewing them as a bourgeois trap. What \u0026ldquo;I\u0026rdquo; feel in the present moment is merely a reflection of my environment, determined by my class. Liberalism believes voters can make the best choice; socialist humanism believes the organization can make the best choice. The individual must obey the organization\u0026rsquo;s decisions, not personal feelings.\nEvolutionary Humanism: Evolutionary humanism derives from Darwin\u0026rsquo;s theory of evolution. It holds that conflict is a form of evolution — eliminating the weak, survival of the fittest. Superior people deserve to survive; this is the law of human evolution. Evolutionary humanism was once all the rage, giving rise to many ideas such as eugenics, racism, and fascism.\nFrom 1914 to 1989, the three humanisms waged a war of faith. Liberalism and socialism joined forces to defeat Nazism in World War II. Then liberal nations and the Soviet Union each rallied allies into the Cold War. In the early Cold War, socialism consistently held the upper hand (the documentary The Vietnam War is highly recommended here) — students at UC Berkeley even kept Chairman Mao\u0026rsquo;s Little Red Book by their bedsides. Then, everything changed. The Soviet Union collapsed. Many countries shifted their beliefs; we too introduced market capitalism. People preferred supermarkets (or Taobao) and money-making companies over a system that allocated food and clothing. Liberalism won a sweeping victory in this war of faith — they even evolved further, adopting ideas and institutions from their rivals to provide better education, healthcare, and social security than before. But liberalism\u0026rsquo;s core ideology remained unchanged.\nDataism # Dataism holds the following three views:\nOrganisms are algorithms. Intelligence can exist without consciousness. Highly intelligent algorithms know me better than I know myself. Organisms Are Algorithms # \u0026ldquo;Organisms are algorithms\u0026rdquo; — I couldn\u0026rsquo;t accept this notion when I first encountered it either. How could organisms be algorithms? Doesn\u0026rsquo;t human experience matter? Is human consciousness worthless?\nLooking at capitalism and Soviet-style communism from the perspective of data processing, they are no longer ideological opposites but rather different data algorithms. Capitalism employs a distributed algorithm; Soviet-style communism employs a centralized algorithm. Capitalism allows connections between consumers and producers, permits individuals to freely exchange information and make independent decisions — the pricing and output of goods are determined by the free market. Soviet-style communism, on the other hand, severed the link between producers and consumers: the government collected consumption data and issued production directives to producers. The government took all of the workers\u0026rsquo; productive surplus, then determined what each individual needed, then re-distributed accordingly. Tax rates work the same way — high tax rates essentially concentrate more resources together, with the government as a single processor deciding how resources are allocated and utilized.\nA single processor can\u0026rsquo;t possibly make the right decisions forever. No one person can handle such enormous amounts of data — even today\u0026rsquo;s high-speed computers can\u0026rsquo;t process it all.\nFrom the perspective of Dataism, capitalism won the Cold War because its distributed algorithm was better suited to that era than Soviet-style communism\u0026rsquo;s centralized algorithm: the better data algorithm prevailed. When we chose to embrace the market economy and abandon Soviet-style communism, it was equivalent to decentralizing processing power to every individual, no longer using the single-processor model. That\u0026rsquo;s why Socialism with Chinese Characteristics survived the Cold War, while the Soviet single-processor data model failed utterly. Currently, only a very few authoritarian states still use this single-processor model — and after all these years, we\u0026rsquo;ve seen no productivity advances from them. This is also a real-world reflection of \u0026ldquo;organisms are algorithms.\u0026rdquo;\nI\u0026rsquo;m merely using the Dataist lens to view economic models here — no intention of judging which model is better or worse. Beyond fitting economic models so neatly, Dataism can also be applied to view problems in many other domains.\nIntelligence Can Exist Without Consciousness # First, we need to be clear: what is consciousness? Someone might say, \u0026ldquo;Consciousness is the self,\u0026rdquo; or \u0026ldquo;Consciousness is the voice inside.\u0026rdquo; These don\u0026rsquo;t answer the question scientifically. Note: science deals with objective facts; subjective matters fall outside science\u0026rsquo;s domain — they belong to theology. We cannot explain the subjective using the subjective. In truth, humanity still hasn\u0026rsquo;t figured out what consciousness is.\nIf every person is an algorithm, then there\u0026rsquo;s really no concept of \u0026ldquo;autonomous consciousness.\u0026rdquo; We can regard what we hear, smell, and see as \u0026ldquo;input data.\u0026rdquo; After computation by our biological organism, a response is produced and an action taken — that\u0026rsquo;s the \u0026ldquo;output data.\u0026rdquo; The human body itself is more like a CPU — perhaps one that can self-regulate, but even the regulation itself requires data input, like learning knowledge or exercising. So what role does \u0026ldquo;self-consciousness\u0026rdquo; play in this process? I can clearly make choices about something — if I choose differently, a different outcome results. I must be consciously aware\u0026hellip; right? This question may not be so easy to answer. If, hypothetically, there were no subjective consciousness — not brain death, but \u0026ldquo;I can\u0026rsquo;t feel my self\u0026rdquo; — would \u0026ldquo;I\u0026rdquo; still make different choices?\nFrom a biological perspective, consciousness is nothing more than countless electrical currents in the brain\u0026rsquo;s neural network. When \u0026ldquo;I\u0026rdquo; make a different choice, it may simply be that some nerve ending fired an extra tiny electrical pulse. \u0026ldquo;Self-consciousness\u0026rdquo; played no role whatsoever in this process. Without \u0026ldquo;me,\u0026rdquo; it seems my body could still make different choices, as long as the \u0026ldquo;algorithm\u0026rdquo; stored in my body still exists. If \u0026ldquo;self-consciousness\u0026rdquo; exists, it\u0026rsquo;s more like a belief rather than an objective fact — like believing in God. The most cutting-edge biological science suggests that consciousness is merely a byproduct of an individual organism\u0026rsquo;s algorithms — it could even be viewed as a kind of mental pollution.\nThen we arrive at another question: is artificial intelligence (AI) conscious? If it\u0026rsquo;s not conscious, can we treat it as an intelligent being? The best method humans currently have for testing whether AI has consciousness is the Turing Test. The Turing Test\u0026rsquo;s logic is simple: as long as a normal human can\u0026rsquo;t tell whether the AI is human or not, it passes. In other words, once AI becomes smart enough, we humans have no choice but to consider it \u0026ldquo;conscious.\u0026rdquo;\nAlgorithms Know Me Better Than I Know Myself # Humanism calls on us to listen to our inner authentic voice. But if the self doesn\u0026rsquo;t even exist, what is there to listen to? Dataism calls on us to \u0026ldquo;listen to the algorithm\u0026rsquo;s advice\u0026rdquo; — the algorithm knows me better than I know myself. For example: when a woman is on a blind date and meets two men who both seem suitable, without algorithmic assistance, she would follow her inner voice and choose the one who \u0026ldquo;feels\u0026rdquo; more right. Now imagine an algorithm tells her: \u0026ldquo;I know you very well. I know you\u0026rsquo;re attracted to Man A; you\u0026rsquo;ll choose him. But he will ultimately break your heart and leave you. Man B is the one for you — and if you choose B, you\u0026rsquo;ll fall in love just as quickly. He will give you lasting happiness. This is a choice you won\u0026rsquo;t regret.\u0026rdquo; From any angle, shouldn\u0026rsquo;t she listen to the algorithm\u0026rsquo;s advice rather than that fleeting feeling of the moment?\nIn its early R\u0026amp;D phase, algorithms are built by engineers continuously piling up code. At this stage, people still have a decent grasp of what the algorithm is \u0026ldquo;thinking.\u0026rdquo; But algorithms can self-learn and self-update. Their learning capacity is utterly beyond human comparison. They will gradually carve out their own path, until humans can no longer keep up.\nClosing Thoughts # Homo Deus is, as ever, packed with substance — novel, robust ideas, all-encompassing. A highly recommended work. While reading, I often paused to reflect: does what he\u0026rsquo;s saying match reality? Is it correct? Many times, I felt shocked. I used to never underline when reading books, but I did so with this one. When I finished, I found the book covered in my highlights.\nThis monumental work is so dense with content that this article can\u0026rsquo;t possibly cover everything. This piece is relatively one-sided — I\u0026rsquo;ve mostly only discussed productivity-related viewpoints. There\u0026rsquo;s actually a great deal of other fascinating material, such as the book\u0026rsquo;s perspective on \u0026ldquo;happiness\u0026rdquo;: \u0026ldquo;Would you rather be an unhappy but wealthy Singaporean, or a happy but poor Costa Rican?\u0026rdquo; I don\u0026rsquo;t know how I\u0026rsquo;d answer. But if the author rephrased the question to me as: \u0026ldquo;Would you rather eat more hot pot, or eat vegetables and whole grains every day, maintaining a nutritionally balanced, healthy body?\u0026rdquo; — then I\u0026rsquo;d definitely answer: hot pot.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-homo-deus-a-brief-history-of-tomorrow/","section":"Posts","summary":"Preface # Homo Deus: A Brief History of Tomorrow is one of the trilogy by Israeli historian Yuval Noah Harari. The trilogy consists of Sapiens: A Brief History of Humankind, Homo Deus: A Brief History of Tomorrow, and 21 Lessons for the 21st Century. The most famous, of course, is Sapiens — an extraordinarily sweeping book about the history of human civilization that can absolutely reshape your view of history. Last year (2022), I stubbornly gnawed through the English original of Sapiens page by page — quite an achievement. Because I loved Sapiens so much, Homo Deus, the sequel from this giant of a thinker, naturally became this year’s most important “extracurricular reading.”\n","title":"Book Notes — Homo Deus: A Brief History of Tomorrow","type":"posts"},{"content":" Preface # Mention Romance of the Three Kingdoms, and it seems almost everyone can name a few characters or plot points. But have you actually read the original?\nI\u0026rsquo;ve always been a fan of Three Kingdoms-themed games — titles like Bàwáng Dàlù (The Overlord\u0026rsquo;s Continent) and Total War: Three Kingdoms are among my favorites. I love the feeling of collecting famous generals and rampaging across the battlefield. But thinking back, I realized I\u0026rsquo;d never actually read Romance of the Three Kingdoms in its entirety. Some of those generic officers in Total War — I had no idea who they were. And when I thought about it, I couldn\u0026rsquo;t come up with a single novel that could stand toe-to-toe with Romance, so I decided to give the original a try. Once I started, I couldn\u0026rsquo;t stop\u0026hellip;\nRomance of the Three Kingdoms is written in the vernacular Chinese of the ancient period, which differs somewhat from modern vernacular Mandarin. For example, \u0026ldquo;暗赍金帛，结交中涓封谞\u0026rdquo; means secretly bringing gold and silk to befriend the eunuch Feng Xu (赍, pronounced jī, means \u0026ldquo;to bring\u0026rdquo; — a common exam term; 中涓 was a close-attendant official title, later used to refer to eunuchs in general). At first, it was admittedly hard going, but after a while it became quite smooth. When I didn\u0026rsquo;t understand something, I\u0026rsquo;d just check the annotations or underline it (once again, thank you, e-books). Also, a reading tip: skip the preface. I recommend mentally filtering out keywords like \u0026ldquo;peasant uprising\u0026rdquo;, \u0026ldquo;dialectical\u0026rdquo;, \u0026ldquo;feudal\u0026rdquo;\u0026hellip;\nMany people confuse Romance of the Three Kingdoms with Records of the Three Kingdoms. Let me emphasize this: Romance of the Three Kingdoms is a novel; Records of the Three Kingdoms is an official history. Some might counter, \u0026ldquo;But the Records was privately compiled,\u0026rdquo; or \u0026ldquo;There is no single truth in history.\u0026rdquo; You can believe there\u0026rsquo;s no absolute truth in history, but if you carry that attitude into historical scholarship, then there\u0026rsquo;s no point studying history at all. Not every statement in an official history is precise — some contain ambiguous or even contradictory accounts — but that only affects its reference value, not its status as an official history. The Records of the Three Kingdoms is one of the Twenty-Four Histories, an undisputed official history, beyond all doubt. Romance of the Three Kingdoms is a novel written with deep reference to the Records, upon which artistic embellishments were layered.\nUnless I explicitly mention the Records, this piece discusses the novel alone. Although I dipped into the Records (and discovered that Total War draws primarily from the Records 👍), I found it too hardcore and decided to give up. In any case, the novel\u0026rsquo;s characters and plotlines all involve artistic license and differ from history — readers, please keep the distinction in mind. Red Cliffs # There are many plotlines in Three Kingdoms worth discussing, but given space constraints (or, honestly, I just don\u0026rsquo;t feel like writing more), I\u0026rsquo;ll focus on Red Cliffs.\nThe Battle of Red Cliffs is undoubtedly the crown jewel of the novel. All the great names make their entrance, stratagems fly thick and fast, and the intellectual duels between Zhou Yu and Kongming (Zhuge Liang) elevate the battle to the pinnacle of wit. I\u0026rsquo;ve always been fond of Zhou Yu — brimming with talent, dashing and heroic, brave and resourceful, commanding armies with brilliance, achieving greatness young (a winner in life), with the ability of a king\u0026rsquo;s right-hand minister. But to highlight Zhuge Liang\u0026rsquo;s genius, the novel deliberately places Zhou Yu\u0026rsquo;s talents a notch below Kongming at every turn, making him Red Cliffs\u0026rsquo; absolute foil to set off Zhuge Liang. As I watched the TV series and read the novel, I increasingly felt that the early-period Zhuge Liang was simply a \u0026ldquo;monster\u0026rdquo; — \u0026ldquo;utterly inhuman.\u0026rdquo;\nRed Cliffs features an all-star cast. All three of Liu Bei\u0026rsquo;s top strategists were involved: Zhuge Liang, Pang Tong (then serving Wu), and Xu Shu (then in Cao Cao\u0026rsquo;s camp) each played critical roles in the battle\u0026rsquo;s schemes. The warriors basically just cleaned up — only Liu Bei and Guan Yu visited Wu\u0026rsquo;s naval camp, and Zhao Yun once rescued the strategist. Wu itself was the protagonist, naturally — Zhou Yu, Lu Su, Huang Gai, Gan Ning, Kan Ze all had major parts, and the rest of the Wu officers were united in purpose, not a single one dragging their feet. On Cao Cao\u0026rsquo;s side: the Chancellor himself, advisers Cheng Yu and Xun You (Xun Yu and Jia Xu didn\u0026rsquo;t come; Guo Jia had died young), officers Mao Jie and Yu Jin, famous generals like Zhang Liao and Xu Chu essentially making cameo appearances, plus the tragic patsies Cai Mao, Zhang Yun, Cai Zhong, and Cai He, and the clown Jiang Gan\u0026hellip; I suspect many people have never read Red Cliffs carefully, or perhaps never read the original at all. I specifically drew a flowchart:\nTwo favorite passages:\n\u0026ldquo;A gust of wind blew, lifting a corner of the banner to brush across Zhou Yu\u0026rsquo;s face. Yu suddenly recalled something weighing on his heart, let out a great cry, fell backwards, and vomited blood.\u0026rdquo; This brief passage is gripping, vivid as a film scene, and underscores the importance of the southeast wind — not a single word wasted.\nYu said: \u0026ldquo;\u0026lsquo;Man\u0026rsquo;s fate shifts between morning and evening\u0026rsquo;; how can one guarantee one\u0026rsquo;s safety?\u0026rdquo; Kongming smiled and replied: \u0026ldquo;\u0026lsquo;The heavens hold storms none can foresee\u0026rsquo;; how can man predict them?\u0026rdquo; Yu turned pale upon hearing this and feigned moans of pain\u0026hellip; Kongming smiled: \u0026ldquo;I have a prescription that will settle the Commander\u0026rsquo;s distress.\u0026rdquo; The entire exchange never once mentions the east wind, yet Yu and Liang have already dueled several rounds over it. Truly brilliant~\nEmbellishments # Adding one\u0026rsquo;s own interpretations or plot elements on top of the original — I call these \u0026ldquo;embellishments.\u0026rdquo; The original plot is already extraordinarily compelling. Even where modern readers might find things hard to understand, if you immerse yourself in the mindset of ancient (Eastern Han!) people, there are virtually no logical gaps. This is one reason why Romance of the Three Kingdoms is held in such high regard. That\u0026rsquo;s why many people still prefer the old TV adaptation that respects the original (with minimal changes) over the new adaptation full of embellishments. Some embellishments — no one even knows who started them — conspiracy theorists abound, and many fabricated plotlines have become widely accepted as fact, which is truly a shame.\nKongming letting Cao Cao escape. At Huarong Trail, where Guan Yu spares Cao Cao out of a sense of honor, Kongming deliberately sent Guan Yu knowing Cao Cao would be released. This has spawned countless interpretations. But in the original, Cao Cao escapes simply because Kongming, observing the stars at night, concluded that Cao Cao was not fated to die that night. Don\u0026rsquo;t dismiss this as childish — the novel treats \u0026ldquo;star-reading\u0026rdquo; as a very real mystical phenomenon. \u0026ldquo;Read the stars and release Cao Cao\u0026rdquo; — this reason is entirely sufficient in the novel\u0026rsquo;s own terms. As for \u0026ldquo;they feared Wei\u0026rsquo;s retaliation so they let Cao Cao go\u0026rdquo; — pure later embellishment. Romance never once features a plot where someone refrains from killing out of fear of retaliation. The same goes for Guan Yu\u0026rsquo;s death.\nGuan Yu\u0026rsquo;s death. In the original, both Wei and Wu wanted Guan Yu dead — such was the era, and in the novel, Guan Yu was a godlike figure. You couldn\u0026rsquo;t take Jing Province without killing him; both sides went all out. It\u0026rsquo;s true that later, when Liu Bei raised a great army for revenge, both Wu and Wei tried to pass the blame — but that\u0026rsquo;s all after Guan Yu\u0026rsquo;s death. Also, many later commentators believe Guan Yu should have defended Jing Province rather than attacking. Here I must clear General Guan\u0026rsquo;s name: attacking Fancheng was Zhuge Liang\u0026rsquo;s order. Guan Yu simply failed to take it.\nPang Tong\u0026rsquo;s death. The original says Kongming, observing the stars at night, saw a general\u0026rsquo;s star falling and sent a letter warning Liu Bei to be cautious. But Pang Shiyuan (Pang Tong) suspected Kongming was just afraid of him stealing glory and urging Liu Bei to advance slowly, so he in turn pressed Liu Bei to speed up the campaign — and ultimately died at Fallen Phoenix Slope. The new Three Kingdoms TV series embellished this: Liu Bei couldn\u0026rsquo;t bear to seize Yi Province, he knew there was an ambush but still entered Fallen Phoenix Slope, sacrificing himself to give Liu Bei a pretext to break with Liu Zhang\u0026hellip; (this embellishment honestly disgusts me). Liu Bei and Liu Zhang\u0026rsquo;s conflict actually escalated gradually — Liu Zhang\u0026rsquo;s subordinates were already fighting Liu Bei, but the final break came when Liu Zhang discovered Zhang Song\u0026rsquo;s letter of surrender and realized Liu Bei\u0026rsquo;s wolfish treachery. While we\u0026rsquo;re here, let\u0026rsquo;s discuss a frequently debated detail: did Liu Bei give Pang Tong the Dílú horse? (The Dílú was said to bring misfortune to its rider; Xu Shu once advised Liu Bei to gift it to an enemy to avert the curse, then ride it himself — but immediately said he was merely testing Liu Bei\u0026rsquo;s character.) I\u0026rsquo;ve seen many comments assuming Liu Bei gave Pang Tong the Dílú, but reading the original carefully, it\u0026rsquo;s actually quite ambiguous. Liu Bei gave Pang Tong a white horse, but it\u0026rsquo;s never specified as the Dílú. In fact, after leaping across Tan Stream, the Dílú basically vanishes from the story. If it truly brought misfortune, Liu Bei had given it to Liu Biao (who returned it upon learning of the curse) — and Liu Biao died anyway. If it didn\u0026rsquo;t truly bring misfortune, that\u0026rsquo;s also plausible. Romance doesn\u0026rsquo;t treat all mystical elements as absolute truth: believing in them can be called respecting the spirits; disbelieving can be called being an extraordinary man or hero. Liu Bei and the Dílú lean more toward the latter, because Xu Shu was really just testing Liu Bei\u0026rsquo;s benevolence: \u0026ldquo;A man\u0026rsquo;s life and death are determined by fate — how could a horse be the cause?\u0026rdquo; If the horse truly brought misfortune, Xu Shu wouldn\u0026rsquo;t have said \u0026ldquo;I was testing you.\u0026rdquo; So personally, I believe what Liu Bei gave was not the Dílú, but simply one of his ordinary white horses. For a lord to gift his own horse was an immense honor in ancient times — this was simply meant to show Liu Bei\u0026rsquo;s genuine affection for Pang Tong. Later generations just preferred the Dílú storyline and embellished accordingly.\nDiaochan\u0026rsquo;s righteousness. Only very rarely do embellishments improve things. In the original, the eighteen lords\u0026rsquo; coalition was utterly helpless against the Western Liang army. After entering Luoyang, they all went their separate ways — while the Emperor remained in Dong Zhuo\u0026rsquo;s clutches\u0026hellip; And then, contrast this with Diaochan, a mere woman: solely to repay Minister Wang Yun for raising her (Diaochan was his adopted daughter), she offered her body and successfully drove a wedge between Dong Zhuo and Lü Bu. After this, the novel mentions Diaochan very little (she simply follows Lü Bu, with no further plot involvement). The Three Kingdoms TV adaptation\u0026rsquo;s treatment of Diaochan after her success is truly brilliant. The old Three Kingdoms series adds an epilogue for her: to a hauntingly beautiful melody, Diaochan retreats into obscurity after her great deed, never to be heard from again. The fate of a nation rested on a frail woman — starkly contrasting with the warlords\u0026rsquo; failure against Dong Zhuo and their secret scheming against each other. This segment is exquisite. Diaochan is a true hero! Compare this to the new Three Kingdoms\u0026rsquo; treatment of Diaochan: pure schlock, fabricating a romance between Lü Bu and Diaochan — utterly an embellishment, disrespecting the original and even disrespecting Diaochan.\nMysticism # It\u0026rsquo;s a novel, after all — many plot points are dramatized additions (the same goes for Water Margin and others). A bit of artistic license for reading pleasure is \u0026ldquo;the finishing touch on a dragon painting,\u0026rdquo; not \u0026ldquo;drawing legs on a snake.\u0026rdquo; Personally, I prefer to read Romance of the Three Kingdoms as a fantasy novel rather than a historical one.\nThe Yellow Turban Rebellion. The Yellow Turban Rebellion was less a peasant uprising than a religious war. At first, seeing Zhang Jiao cure people with talisman water, I assumed the author was portraying the Yellow Turbans as uncivilized charlatans. Then I discovered that Yu Ji also cured people with talisman water — and Yu Ji is clearly a positive character. Sun Ce disbelieved and ended up being mystically killed by the Little Conqueror. So talisman-water healing is a real thing in the author\u0026rsquo;s universe. The three Zhang brothers genuinely possess supernatural abilities, and the Yellow Turban army is basically a religious sect. I eventually accepted the talisman-water premise.\n\u0026ldquo;His ears hung down to his shoulders, his hands reached past his knees, and his eyes could see his own ears.\u0026rdquo; Hands past the knees, fine — but eyes that can see your own ears? That\u0026rsquo;s not an eye problem, that\u0026rsquo;s an ear problem. The man was probably an elephant\u0026hellip;\nKilling one\u0026rsquo;s wife for food. While fleeing and seeking sustenance, Liu Bei encounters a hunter. Having found no game, the hunter kills his wife and serves her as food. Liu Bei only realizes the previous night\u0026rsquo;s meal was the man\u0026rsquo;s wife: \u0026ldquo;overcome with sorrow, he shed tears and mounted his horse.\u0026rdquo; When Cao Cao hears of the \u0026ldquo;kill-wife-for-food\u0026rdquo; incident, \u0026ldquo;Cao ordered Sun Qian to reward him with a hundred taels of gold.\u0026rdquo; Even someone like me, with fairly open views, was utterly shocked reading this. It\u0026rsquo;s astonishing how different ancient values were from ours, and lamentable how low women\u0026rsquo;s status was — mere objects\u0026hellip;\nStar-reading. In ancient times, star-reading was an official government post. In Romance, it\u0026rsquo;s a skill possessed by high-level strategists. Pang Tong lacks star-reading ability; Zhuge Liang and Sima Yi possess it.\nRǎng (禳). Actively attempting to alter fate. When Liu Bei\u0026rsquo;s Dílú horse threatened misfortune, the method Xu Shu described to dispel the calamity was called a rǎng ritual. Zhuge Liang used the qí-rǎng ritual to pray to the Northern Dipper, seeking to extend his life by one jì (twelve years).\nFlaws # Some important characters are described too sketchily. \u0026ldquo;By the time Song arrived, Zhang Jiao was already dead\u0026rdquo; — just a handful of words dismiss the death of the leader who ignited the earth-shaking Yellow Turban Rebellion. I always find this hard to accept; the author doesn\u0026rsquo;t even tell us how Zhang Jiao died. (If you Baidu it, you\u0026rsquo;ll just get middle-school history memorization paragraphs about why the Yellow Turban uprising failed\u0026hellip;)\nSome plotlines are repetitive. The famous \u0026ldquo;Borrowing Arrows with Straw Boats\u0026rdquo; actually appeared earlier. Sun Jian, while attacking Huang Zu, had a similar arrow-borrowing episode: \u0026ldquo;Jian plucked the arrows embedded in his boats, amounting to over a hundred thousand.\u0026rdquo; Red Cliffs and Yiling share similarities too — \u0026ldquo;southeast wind,\u0026rdquo; \u0026ldquo;boats loaded with thatch,\u0026rdquo; and \u0026ldquo;fire attack\u0026rdquo; are all keywords of Yiling as well. The Girdle Edict in the early chapters is a compelling storyline, and later there\u0026rsquo;s a parallel with Wei Emperor Cao Fang\u0026rsquo;s blood-written edict.\nAfter Zhuge Liang\u0026rsquo;s death, the later plot isn\u0026rsquo;t very engaging. By then, almost everyone I knew was dead. There\u0026rsquo;s Jiang Wei and Deng Ai to follow, perhaps, but the plotlines are formulaic and dull. The new characters are numerous but lack distinctive portrayals — you basically can\u0026rsquo;t remember them. Later battle scenes all follow the same template: feign defeat, lure the enemy deep, a cannon blast, then charge.\nCharacter Biographies # A sharp-tongued review of several characters, with brief introductions, summaries, and key deeds. Though it doesn\u0026rsquo;t quite align with the novel\u0026rsquo;s spirit, people always love debating martial prowess and intelligence scores. Having read the entire novel, I\u0026rsquo;ll try to discuss the numbers here too.\nFor one-on-one combat ratings, it\u0026rsquo;s not about who defeated whom — Romance features many draws, or fights broken off after twenty or thirty bouts for various reasons. I measure by number of bouts exchanged. In Romance, 100 bouts is generally the upper limit; fighters may rest and resume for another 100, as with Ma Chao and Xu Chu.\nWei Side # Cao Cao: Military strategist, statesman, man of letters. Extraordinarily fond of talent, shrewd himself, maxed out in both intelligence and ruling ability. The man who won the Central Plains battle royale. Welcoming Emperor Xian and establishing military farms (túntián) were both pivotal moves. There\u0026rsquo;s too much to say\u0026hellip; Everyone knows \u0026ldquo;a crafty hero in turbulent times,\u0026rdquo; but few mention \u0026ldquo;an able minister in peaceful times.\u0026rdquo;\nXun Yu: Cao Cao\u0026rsquo;s key early strategist, intellect no less than Guo Jia. Loyal to the Han dynasty to the end. Killed by Cao Cao.\nXun You: Wei\u0026rsquo;s strategist in the humiliating Red Cliffs campaign. Has intelligence but a notch below Guo Jia and Xun Yu.\nGuo Jia: Flawless. The number one grand-strategy adviser, relying on intellect rather than mysticism. Universally beloved. Died of illness while accompanying Cao Cao on the northern campaign. After Cao Cao\u0026rsquo;s defeat at Red Cliffs, he wept that Fengxiao (Guo Jia) was no longer with them — all others hung their heads in shame.\nCheng Yu: A strategist who appeared frequently in the early period. High intelligence — personally, I\u0026rsquo;d rate him roughly on par with Cao Cao: top-tier, but below Xun Yu and Guo Jia. At Red Cliffs, he saw through the southeast wind issue but was talked down by Cao Cao.\nJia Xu: Adviser to Li Jue, later joined Zhang Xiu, later surrendered to Cao Cao. An important mid-period Wei strategist.\nXu Chu: Wei\u0026rsquo;s top-tier solo combat god. Captured He Yi alive in one bout. Fought Ma Chao for 200 bouts — the bare-chested war god. During the Hanzhong campaign, drunk on grain-transport duty, he was slow to react and got stabbed in the shoulder by Zhang Fei. Limited appearances after that.\nDian Wei: Wei\u0026rsquo;s top-tier solo combat god. Master of twin halberds. Fought Xu Chu for two full shíchén (four hours). Felt like Cao Cao\u0026rsquo;s personal bodyguard. Killed during Zhang Xiu\u0026rsquo;s rebellion. Limited combat record.\nCao Ang: Cao Cao\u0026rsquo;s eldest son by Lady Liu. Killed during Zhang Xiu\u0026rsquo;s rebellion. Gave his horse to his father to ride, couldn\u0026rsquo;t escape himself. After the battle, Cao Cao wept only for Dian Wei, not for Cao Ang\u0026hellip;\nCao Pi: Cao Cao\u0026rsquo;s eldest son by Lady Bian. One of the Three Caos. Proclaimed himself emperor immediately after Cao Cao\u0026rsquo;s death. Defeated at Hefei.\nCao Zhang: Cao Cao\u0026rsquo;s second son by Lady Bian. Has combat achievements — defeated Liu Feng in three bouts. A pure warrior archetype. \u0026ldquo;A real man should emulate great generals like Wei Qing and Huo Qubing, leading a hundred thousand troops across the desert, driving out the barbarians, building a legacy of achievement — who would want to be a scholar?\u0026rdquo;\nCao Zhi: Cao Cao\u0026rsquo;s third son by Lady Bian. One of the Three Caos. \u0026ldquo;Vain and flashy, lacking sincerity, addicted to wine and unrestrained.\u0026rdquo;\nCao Xiong: Cao Cao\u0026rsquo;s fourth son by Lady Bian. Killed in the power struggle when Cao Pi succeeded to the throne.\nCao Chong: Not mentioned in the novel.\nCao Ren: A commanding general, no solo combat record, but a master of city defense. (His troops) shot Zhou Yu at Nanjun; (his troops) shot Guan Yu at Fancheng. Died during Cao Pi\u0026rsquo;s reign.\nCao Hong: Often appears leading troops. Fought He Man for fifty bouts and killed him in single combat. Personally killed Yuan Tan. Rescued Cao Cao at a critical moment. Fought Ma Chao for fifty bouts — his blade technique grew disordered, his strength failing. With Cao Xiu, forced the Han Emperor to abdicate. No further appearances.\nXiahou Dun: Fierce and bold. Took an arrow to the eye and swallowed his own eyeball. Fought Gao Shun for fifty bouts — victory. During \u0026ldquo;Crossing Five Passes and Slaying Six Generals,\u0026rdquo; he challenged Guan Yu to a duel, interrupted by Zhang Liao. The Wei protagonist at Bowang Slope. Died of illness during Cao Pi\u0026rsquo;s reign.\nXiahou Yuan: Master of long-distance rapid strikes (couldn\u0026rsquo;t find the original quote). Many appearances leading troops — a general-type commander. Later killed by Huang Zhong at Mount Dingjun.\nZhang Liao: Leader of the Five Elite Generals. Formerly under Lü Bu; close friends with Guan Yu. First-rate at leading troops, decent at solo combat. While accompanying Cao Pi against Wu, shot in the waist and killed by Wu officer Ding Feng.\nZhang He: Of the Five Elite Generals. Framed by Guo Tu at Guandu; defected to Cao Cao. Defeated by Zhang Fei (at Zhang Fei\u0026rsquo;s tomb in Langzhong you can still see Zhang Fei\u0026rsquo;s inscription \u0026ldquo;Great Victory over Zhang He\u0026rsquo;s Forces\u0026rdquo;). Seems only able to trade a few dozen bouts with Zhang Fei — solo combat: average; commanding troops: first-rate. More appearances in the early period. Later, pursuing too deep, killed by Kongming\u0026rsquo;s massed crossbows at Jianmen Pass.\nXu Huang: Of the Five Elite Generals. Appears so often it\u0026rsquo;s impossible to recount everything. During Li Jue and Guo Si\u0026rsquo;s rebellion, served under Yang Feng, later defected to Cao Cao. Fought Xu Chu for fifty bouts — solo combat: decent. Also close friends with Guan Yu (during the Yan Liang-Wen Chou incident, Zhang Liao and Xu Huang fought poorly; Guan Yu stepped up and cut each down in one stroke — presumably this is when they became friends\u0026hellip;). With Cao Ren, jointly defeated Guan Yu\u0026rsquo;s Jing Province army. Later, when Meng Da rebelled again, was shot in the forehead and died.\nYue Jin: Of the Five Elite Generals. Also appears very frequently, often leading troops. Fought Lü Bu\u0026rsquo;s officer Zang Ba for thirty bouts; fought Ling Tong for fifty bouts — solo combat: average. During the Hefei campaign against Sun Quan, while dueling Ling Tong, Cao Xiu shot Ling Tong off his horse; Gan Ning then shot Yue Jin in the face with a single arrow. Never appears again — unclear if he recovered.\nYu Jin: Of the Five Elite Generals. During Zhang Xiu\u0026rsquo;s rebellion, when people accused Yu Jin of defecting, he didn\u0026rsquo;t first clear his name but instead set up camp to resist the enemy — praised by Cao Cao. When Fancheng was besieged, he led reinforcements. Afraid Pang De would steal glory, he engaged in various petty maneuvers. Badly positioned his troops; Guan Yu flooded them and captured him. Yu Jin surrendered. After Lü Meng took Jing Province, he released the imprisoned Yu Jin back to Wei. Later scorned by Cao Pi; died in despondency.\nPang De: Previously under Ma Chao. Extraordinarily brave — a personal favorite. Carried his own coffin into battle. Could fight Guan Yu for 100 bouts — the highest honor in the solo-combat world. His reputation doesn\u0026rsquo;t match the Five Tiger Generals, Xu Chu, or Dian Wei, but I personally believe his solo combat ability is on the same level. Unfortunately, never truly utilized. Then dragged down by his deadweight teammate Yu Jin: Guan Yu flooded seven armies and captured him. Refused to submit to Guan Yu, refused to surrender, was executed. A true hero.\nLi Dian: Frequently appears leading troops in the early-mid period. Captured Huang Shao alive. Solo combat: exchanged about ten bouts with Zhao Yun, realized he was outmatched, turned his horse and retreated. Never seen again after the Hefei campaign.\nLady Zhen: Yuan Xi\u0026rsquo;s wife. After Yuan Shao\u0026rsquo;s defeat, Cao Pi snatched her and made her empress.\nSima Yi: Late-period god-tier grand strategist. Can read stars. Can even grab a blade and solo. Fought Zhuge Liang to a standstill around Hanzhong. Never lost to Liang at the grand strategic level. Later seized Cao Shuang\u0026rsquo;s military power; the Sima clan took control of Wei.\nSima Shi: Sima Yi\u0026rsquo;s eldest son. His characterization in the late period is relatively well done. \u0026ldquo;Round face, large ears, square mouth, thick lips. Under his left eye grew a black mole, from which sprouted dozens of black hairs.\u0026rdquo; While battling Wen Yang, \u0026ldquo;his eyeball burst out from the mole\u0026rsquo;s wound, blood streaming across the ground. In unbearable agony, yet fearing it would unsettle the troops, he merely bit his quilt and endured — biting the quilt to shreds.\u0026rdquo; Then bedridden. Shortly after, \u0026ldquo;with a great cry, his eye burst forth, and he died.\u0026rdquo;\nSima Zhao: Sima Yi\u0026rsquo;s second son. Prince of Jin.\nSima Yan: Sima Zhao\u0026rsquo;s son. Emperor of Jin.\nDeng Ai: Late-period undefeated war god. Fought Jiang Wei to a standstill, never lost at the grand strategic level. Rolled down a cliff wrapped in felt, launched a surprise raid into Shu — the Shu people thought divine soldiers had descended from heaven and opened their gates in surrender. The man who conquered Shu.\nShu Side # Liu Bei: Everyone says Liu Bei\u0026rsquo;s benevolence was fake — personally, I think that\u0026rsquo;s an embellishment. From the novel\u0026rsquo;s portrayal of Xuande, Bei was genuinely benevolent. If he\u0026rsquo;d just taken Liu Biao\u0026rsquo;s resources in Jing Province directly, none of that mess would\u0026rsquo;ve happened. When entering Shu, the outcome was indeed duplicitous, but the novel still portrays Xuande with benevolence — I choose to respect the original here.\nGuan Yu: A wildly popular character. Personally not a fan (in real life, this kind of person is extremely annoying). Early period: a god. Cut down foes in a single stroke. Single-handedly drove back Xu Huang + Xu Chu (probably only Lü Bu could match that feat). Arrogant and rude. Bears the lion\u0026rsquo;s share of blame for the loss of Jing Province. The loss of Jing is the novel\u0026rsquo;s plot turning point — Cao Cao, Liu Bei, Zhang Fei, Huang Zhong all die in rapid succession; remaining generals and strategists all fade from the storyline. Also, this man has an arrow-magnet constitution: shot during Crossing Five Passes, shot by an \u0026ldquo;air arrow\u0026rdquo; at Changsha fighting Huang Zhong, shot fighting Pang De, shot with a poisoned arrow attacking Fancheng.\nZhang Fei: Zhang Fei is a highly stylized character, but his combat record is better than Guan Yu\u0026rsquo;s. \u0026ldquo;Round-eyed rogue,\u0026rdquo; brave and cunning, hates evil like an enemy, true to his nature. The only person in the entire Three Kingdoms who dares to taunt Lü Bu. Can fight Lü Bu for 100 bouts. Drank off Cao Cao\u0026rsquo;s army at Changban Slope. Marched into Shu by land. Honorably released Yan Yan. Shattered Zhang He. Stabbed and wounded Xu Chu. Can lead troops, can solo, has tactical intelligence — a top-tier Three Kingdoms general. One scene is especially moving: after Guan Yu\u0026rsquo;s death, Liu Bei kept delaying the revenge campaign. Zhang Fei said to Liu Bei: \u0026ldquo;Our brother is dead — what\u0026rsquo;s the point of being emperor?\u0026rdquo; \u0026ldquo;If you won\u0026rsquo;t avenge our brother, don\u0026rsquo;t bother seeing me again.\u0026rdquo; 👍. I previously visited Zhang Fei\u0026rsquo;s tomb in Langzhong — his calligraphy was remarkably refined, nothing like the crude brute you\u0026rsquo;d imagine\u0026hellip;\nZhuge Liang: A monster.\nPang Tong: \u0026ldquo;Sleeping Dragon and Young Phoenix — obtain one and you can have the realm\u0026rdquo; is pure bluster. Cannot be ranked alongside Zhuge Liang. Combat record is basically negative.\nXu Shu: God-tier grand strategist. Under Liu Bei (attached to Liu Biao), engineered the first-ever defeat of Cao Cao\u0026rsquo;s army (Cao Ren). Defining trait: filial piety\u0026hellip; Cheng Yu forged a letter from his mother to summon Yuanzhi. Xu Shu went to Cao Cao\u0026rsquo;s camp; after his mother committed suicide, Xu Shu, out of pride, still wouldn\u0026rsquo;t return to Liu Bei\u0026rsquo;s side\u0026hellip; utterly baffling.\nFa Zheng: The strategist for Huang Zhong\u0026rsquo;s army when Xiahou Yuan was killed. Other schemes had no weaknesses. Died early. One of only two people Zhuge Liang ever sought advice from.\nMa Su: The other person Zhuge Liang ever sought advice from. During the Southern Barbarian campaign, he was the first to propose a strategy aimed at winning hearts rather than annihilation. As long as he didn\u0026rsquo;t lead troops himself, god-tier. First time leading troops: defeated by Sima Yi. Later executed by the Chancellor. Also: the Chancellor shedding tears as he executed Ma Su — he wasn\u0026rsquo;t crying for Ma Su, but lamenting that the late Emperor\u0026rsquo;s legacy of the northern expedition remained unfulfilled.\nZhao Yun: Never lost a solo fight. No one could go 100 bouts with him. Basically, he\u0026rsquo;d show up and \u0026ldquo;spear them dead in one thrust.\u0026rdquo; Evasion maxed out — \u0026ldquo;the hero of Changban Slope is still in his prime.\u0026rdquo; Rarely led troops; more like Liu Bei\u0026rsquo;s personal guard, protecting the imperial family. Liu Bei called him brother, but he never entered the core trio.\nHuang Zhong: Fought Guan Yu for 100 bouts at Changsha. His horse stumbled and Guan Yu spared him. Later shot an arrow without the arrowhead attached, repaying the debt. Killed Xiahou Yuan in the Hanzhong campaign.\nMa Chao: \u0026ldquo;Splendid Ma Chao.\u0026rdquo; Cao Cao: \u0026ldquo;Ma Chao\u0026rsquo;s valor is no less than Lü Bu\u0026rsquo;s in his prime.\u0026rdquo; Nearly made Cao Cao cut off his beard and discard his robe in flight — Cao was saved by Cao Hong. Fought top-tier warriors Xu Chu and Zhang Fei for 200 bouts each. Rash and cruel; committed city massacres. Limited achievements under Liu Bei.\nWei Yan: Had \u0026ldquo;a rebellious bone at the back of his skull.\u0026rdquo; An important Shu general in the mid-late period. Solo combat: decent. Leading troops: first-rate. Zhuge Liang predicted that after his death, Wei Yan would rebel — killed by Ma Dai. The famous Ziwu Valley gambit: though Sima Yi praised it, I personally think it\u0026rsquo;s a bit far-fetched.\nYan Yan: Solo combat ability basically zero. Pummeled by Zhang Fei. Archery ability: top-tier — shot Zhang Fei\u0026rsquo;s helmet. Participated in the Hanzhong campaign. No further appearances.\nHuang Yueying: Actually, not much of a role. Only described when introducing Zhuge Liang\u0026rsquo;s son Zhuge Zhan: \u0026ldquo;The mother was exceedingly ugly but possessed extraordinary talents: versed in astronomy above, geography below; there was no book of strategy, divination, or escape arts she had not mastered.\u0026rdquo;\nZhuge Zhan: Son of the Marquis of Wu (Zhuge Liang). Hyped up upon debut, then sent against Deng Ai — killed by Deng Ai.\nLiu Feng: Liu Bei\u0026rsquo;s adopted son. Solo combat ability: low. No tactical sense. Easily persuaded. A net negative. Guan Yu disliked him. Later, when Guan Yu was defeated and sought reinforcements, Liu Feng and Meng Da refused to send troops, contributing to Guan Yu\u0026rsquo;s death. Subsequently executed by Liu Bei.\nMeng Da: Betrayed, then betrayed again. Shares blame for Guan Yu\u0026rsquo;s death. Only highlight: shot and killed Xu Huang. Later killed by Sima Yi.\nLiu Bei\u0026rsquo;s Wives: Lady Gan — the one who threw herself into the well at Changban Slope, A-Dou\u0026rsquo;s birth mother. Lady Mi — died while Liu Bei was in Jing Province, which led to Wu\u0026rsquo;s marriage proposal. Sun Shangxiang — a fifty-something old ox marrying a sixteen-year-old girl\u0026hellip;\nMi Zhu: Brother of Liu Bei\u0026rsquo;s wife Lady Mi. A tool character — basically Liu Bei\u0026rsquo;s envoy for delivering messages.\nMi Fang: Brother of Liu Bei\u0026rsquo;s wife Lady Mi. Technically the Emperor\u0026rsquo;s brother-in-law, yet surrendered to Wu. Bears responsibility for Guan Yu\u0026rsquo;s death.\nSun Qian, Jian Yong: Followed Liu Bei in the early period. No particular talent. Tool characters.\nGuan Ping: Guan Yu\u0026rsquo;s adopted son. Fought Pang De for thirty bouts. Later captured alongside Guan Yu by Wu; executed.\nGuan Xing: Guan Yu\u0026rsquo;s biological son. A key general in the mid-late period. Killed Pan Zhang — his father\u0026rsquo;s murderer — and recovered the Green Dragon Blade. A main combat general on the Qishan campaigns. Later died of illness.\nZhang Bao: Zhang Fei\u0026rsquo;s biological son. A key general in the mid-late period. Appears alongside Guan Xing.\nLiao Hua: Originally a Yellow Turban, later followed Guan Yu. During the desperate escape from Mai Castle, ran out to seek reinforcements and survived. Later appears on the Qishan campaigns.\nZhou Cang: Originally under Zhang Bao, later followed Guan Yu — carried Guan Yu\u0026rsquo;s blade. Fought Zhao Yun and lost repeatedly, taking three spear wounds. Solo combat: weak. Committed suicide after Guan Yu\u0026rsquo;s death.\nMa Dai: A late-period Shu general. Frequent appearances. Achievements in the Southern Barbarian campaign and Qishan expeditions. Under the Chancellor\u0026rsquo;s brocade-bag stratagem, executed Wei Yan.\nJiang Wei: A Wei defector. Inherited Zhuge Liang\u0026rsquo;s will. Accomplished in both letters and arms. Launched (I think ten) expeditions from Qishan\u0026hellip; Later, Deng Ai raided Shu; Liu Shan surrendered. Jiang Wei was still holding Jianmen Pass\u0026hellip; \u0026ldquo;We fight to the death — why do you surrender first!\u0026rdquo;\nWu Side # Sun Jian: Among the useless warlord coalition, capable of fighting. Obsessed with the Imperial Seal. Swore he didn\u0026rsquo;t have the Seal — \u0026ldquo;may I be shot dead by random arrows.\u0026rdquo; Later shot dead by Huang Zu\u0026rsquo;s troops.\nSun Ce: Sun Ce the Little Conqueror. A fierce warrior. Essentially conquered all of Jiangdong single-handedly — just died too young. A personal favorite. Trading the Imperial Seal for troops to build his kingdom — a stroke of genius, surpassing his father. Killed by an enemy\u0026rsquo;s revenge attack. While recovering, because he refused to believe in superstition, was mystically killed by Yu Ji.\nYu Ji: The people thought he was an immortal. Could cure people with talisman water. Executed by Sun Ce. His ghost haunted Sun Ce and killed him\u0026hellip;\nSun Quan: Zero military talent whatsoever. Pummeled at Hefei. His strong point: recognizing talent. All four of Wu\u0026rsquo;s early (and most important) Grand Commanders were strong.\nTaishi Ci: Appeared quite early — already present when Liu Bei was helping Tao Qian. Later joined Liu Yao, then was subdued by Sun Ce. Later, done in by Sun Quan at Hefei\u0026hellip;\nGan Ning: \u0026ldquo;Brocade Sail Pirate.\u0026rdquo; Wu\u0026rsquo;s number one combat power. Expert archer. (Honestly, Wu\u0026rsquo;s generals\u0026rsquo; combat ability is not impressive.)\nLing Tong: His father Ling Cao was killed by Gan Ning — a blood feud. During the Hefei campaign, saved by Gan Ning; they reconciled.\nHuang Gai: Master of getting beaten. Actually, no combat highlights on the battlefield. At Red Cliffs, shot off his boat by Zhang Liao with one arrow, rescued by Zhou Yu. No further news.\nZhou Yu: Wu\u0026rsquo;s first Grand Commander. A winner in life. A personal favorite. Too bad: \u0026ldquo;Since Heaven gave birth to Yu, why did it also give birth to Liang?\u0026rdquo;\nZhang Hong: No role.\nZhang Zhao: Leader of the dove faction. Default answer to everything: surrender. Virtually none of his schemes ever worked.\nLu Su: Wu\u0026rsquo;s second Grand Commander. Timid. Appreciated by Zhou Yu. No combat achievements. Many say he was \u0026ldquo;shrewd beneath a foolish exterior\u0026rdquo; — the original does have hints of this, but \u0026ldquo;shrewd beneath foolish\u0026rdquo; is a stretch. Basically just a messenger between Liang and Yu. The originator of the empty-handed Jing Province recovery attempts.\nLü Meng: Wu\u0026rsquo;s third Grand Commander. Mastermind of \u0026ldquo;Crossing the River in White\u0026rdquo; (disguising troops as merchants). Can basically be considered the killer of Guan Yu. When he saw the beacon towers in Jing Province and couldn\u0026rsquo;t find a way to break through, he claimed illness and stayed home (absolutely hilarious). Later seen through by Lu Xun. After Guan Yu\u0026rsquo;s death, mystically killed by Guan Yu\u0026rsquo;s ghost.\nLu Xun: Wu\u0026rsquo;s fourth Grand Commander. Mastermind of Yiling. Later participated in several major campaigns. Lu Xun\u0026rsquo;s talent was not beneath Gongjin\u0026rsquo;s (Zhou Yu\u0026rsquo;s).\nZhou Tai: Fought his way in and out to rescue Sun Quan. For every wound he bore, Sun Quan made him drink a cup of wine.\nPan Zhang: Fought Guan Yu — lasted only three bouts before fleeing. Guan Yu\u0026rsquo;s spirit manifested; killed by Guan Xing.\nDing Feng: Shot and killed Zhang Liao. An important late-period Wu general. Survived until chapter 119.\nMa Zhong: Pan Zhang\u0026rsquo;s subordinate. Many have never heard of this character, but he killed both Guan Yu and Huang Zhong. Killing Guan Yu was cleaning up; killing Huang Zhong was genuine skill — one arrow took down the master archer Huang Zhong. Later assassinated by Mi Fang.\nJiang Qin, Han Dang, Xu Sheng\u0026hellip;: Too many, unremarkable, can\u0026rsquo;t remember.\nSun Shangxiang: No such character exists in the official histories. In the novel, Sun Shangxiang has only personality description — she likes dancing with blades and swords. She never actually participated in combat. No children after marrying Liu Bei. Later tricked into returning to Wu; never saw Liu Bei again. But later generations adore Sun Shangxiang — she\u0026rsquo;s a fan-favorite character. Total War\u0026rsquo;s beauty icon: Others # He Jin: General-in-Chief, Empress He\u0026rsquo;s brother. An utter fool. Held all the cards and played them terribly. To deal with the Ten Regular Attendants, he summoned Dong Zhuo to the capital, setting off an unstoppable chain reaction — the realm fell into chaos.\nZhang Jiao, Zhang Bao, Zhang Liang: Yellow Turban rebel leaders. Could cure with talisman water, summoned divine soldiers. The rest of the time, basically got beaten up by the regular army. A religious peasant uprising, hastily concluded. The novel dismisses it with \u0026ldquo;Zhang Jiao was already dead.\u0026rdquo;\nYuan Shao: Previously under He Jin — back then he was quite strategic, even dared to confront Dong Zhuo directly. \u0026ldquo;Many schemes but poor decisions.\u0026rdquo; His advisers each pushed their own agenda; none of his generals were worth anything.\nYuan Tan, Yuan Xi, Yuan Shang: Yuan Shao\u0026rsquo;s three sons. Still fighting over power after Yuan Shao\u0026rsquo;s defeat.\nLü Bu: The number one warrior of the Three Kingdoms. Early period: unstoppable in single combat. Only lost when ganged up on. (Late period Lü Bu once soloed Zhang Fei.) Cao Cao had suffered at Lü Bu\u0026rsquo;s hands — ultimately Cao Cao was the big winner.\nChen Gong: After Cao Cao\u0026rsquo;s failed assassination of Dong Zhuo, Chen Gong followed him — and witnessed Cao Cao\u0026rsquo;s treachery: \u0026ldquo;Better that I betray the world than let the world betray me.\u0026rdquo; Disgusted with Cao Cao, he left and later joined Lü Bu.\nZhang Song: Liu Zhang\u0026rsquo;s subordinate. Arrogant. Cao Cao disliked him. Later defected to Liu Bei, offered the map of Western Sichuan. Later discovered by Liu Zhang colluding with Liu Bei; executed.\nZhang Xiu: Featured prominently in early battles against Cao Cao. Originally surrendered to Cao Cao, but because his aunt was forcibly taken by Cao Cao, he rebelled. Cao Ang and Dian Wei died in this battle. Later defeated by Cao Cao again; surrendered.\nChunyu Qiong: Commander of the Wuchao supply depot. Drinking ruined everything.\nLi Ru: Dong Zhuo\u0026rsquo;s strategist. Never appears again after Dong Zhuo\u0026rsquo;s death.\nZuo Ci: A full chapter of pure mysticism. Stunned everyone reading it — \u0026ldquo;Come out and see the immortal~\u0026rdquo;\nYu Ji (the physician): Cao Cao\u0026rsquo;s doctor. Tried to poison Cao Cao; discovered.\nHua Tuo: The Three Kingdoms\u0026rsquo; number one physician. Skilled in surgery. Treated Zhou Tai\u0026rsquo;s wounds. Scraped Guan Yu\u0026rsquo;s bones to cure poison. Later, while treating Cao Cao\u0026rsquo;s head ailment, suspected of being a second Yu Ji — died in prison.\nChen Lin: One of the Seven Masters of the Jian\u0026rsquo;an period. His \u0026ldquo;Proclamation Against the Usurper\u0026rdquo; is recommended reading in full.\nFinal Thoughts # There\u0026rsquo;s simply too much to discuss. I thought I could finish this piece in two or three hours — ended up costing several times that. Romance of the Three Kingdoms is truly brilliant, absolutely worth reading (as if that needed saying). I probably won\u0026rsquo;t continue with the Records — time to urgently start the next chapter\u0026hellip;\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-romance-of-the-three-kingdoms/","section":"Posts","summary":"Preface # Mention Romance of the Three Kingdoms, and it seems almost everyone can name a few characters or plot points. But have you actually read the original?\nI’ve always been a fan of Three Kingdoms-themed games — titles like Bàwáng Dàlù (The Overlord’s Continent) and Total War: Three Kingdoms are among my favorites. I love the feeling of collecting famous generals and rampaging across the battlefield. But thinking back, I realized I’d never actually read Romance of the Three Kingdoms in its entirety. Some of those generic officers in Total War — I had no idea who they were. And when I thought about it, I couldn’t come up with a single novel that could stand toe-to-toe with Romance, so I decided to give the original a try. Once I started, I couldn’t stop…\n","title":"Book Notes — Romance of the Three Kingdoms","type":"posts"},{"content":" Preface # An unavoidable work for any sci-fi fan: Arthur C. Clarke\u0026rsquo;s classic — the Space Odyssey series. The Space Odyssey consists of four volumes: 2001: A Space Odyssey, 2010: Odyssey Two, 2061: Odyssey Three, and 3001: The Final Odyssey. As the titles suggest, the futuristic technological visions take place in their respective years. Don\u0026rsquo;t think 2001 has already passed — when Clarke wrote 2001, it was 1968! I, at least, can\u0026rsquo;t imagine what the world will look like thirty years from now, or how far humanity will have advanced in space exploration.\nI wrote a reading reflection after first finishing 2001, captivated by its premise, its thrilling space plotlines, its fantastical cosmic backdrop\u0026hellip; I immediately dove into the remaining three volumes. I initially expected the setting to expand ever outward, but that\u0026rsquo;s not what happened. The later three books remain within this cosmic dimension — between Jupiter and Earth — which is already very, very small. They mostly fill in plot details and imagination, bringing the entire story to completion.\nThe Tetralogy # 2001: A Space Odyssey # After finishing the entire series, this one still feels like the most classic. Maybe it\u0026rsquo;s because the plot was the result of discussions with Kubrick\u0026hellip;\nSince I\u0026rsquo;ve already written a full reflection on it before, I won\u0026rsquo;t belabor it here. Interested friends can check out my earlier Book Notes — 2001: A Space Odyssey.\n2010: Odyssey Two # This volume is also brilliant. In the original, China\u0026rsquo;s spacecraft sent to explore Jupiter is named the Qian Xuesen, and the Qian Xuesen is the first manned mission to land on and explore Jupiter\u0026rsquo;s moon — Europa — beating the Americans to it! Though the outcome wasn\u0026rsquo;t great, the plot is thrilling~ Before the Qian Xuesen\u0026rsquo;s accident, the astronauts described lower life forms on Europa and were ultimately attacked and killed by \u0026ldquo;extraterrestrial organisms.\u0026rdquo; This \u0026ldquo;disaster\u0026rdquo; plotline sparks infinite imagination: what kind of life exists on Europa? And what should we humans do about it?\nFinally, the monolith on Jupiter goes through a series of self-replications and ultimately transforms Jupiter into a white dwarf! Jupiter is ignited! From then on, there are two \u0026ldquo;suns\u0026rdquo; in the sky. This premise is just fantastic~\n2061: Odyssey Three # This one feels a bit rushed, mainly because Halley\u0026rsquo;s Comet was coming. Clarke wrote in the preface that since Halley\u0026rsquo;s Comet was about to sweep past Earth, if he didn\u0026rsquo;t release the book soon, the exploration-of-Halley plotline might become untimely. Indeed, a large portion of this volume is devoted to exploring Halley\u0026rsquo;s Comet. There are some Jupiter-related plotlines, but they don\u0026rsquo;t advance the main narrative much.\nHalley\u0026rsquo;s Comet orbits the sun once every 76 years. Its next return is about 40 years away (July 28, 2061). Thinking back, its last perihelion was roughly when this book was written — the whole world was talking about Halley\u0026rsquo;s Comet. (I can feel that no one\u0026rsquo;s mentioned it in recent years.)\n3001: The Final Odyssey # A perfect concluding work! This conclusion has influenced countless sci-fi novels — you can even clearly sense the shadow of The Three-Body Problem.\nThe first three volumes are all still in the 21st century. 3001 jumps a full thousand years! Humanity now acquires knowledge through \u0026ldquo;brain-computer interfaces\u0026rdquo; rather than learning; the speed of space travel has increased enormously\u0026hellip;\nBut honestly, a thousand years — a thousand years and humanity has only progressed this far? I\u0026rsquo;d rather believe it was because of the sophons. Huh? Could it be that Old Liu\u0026rsquo;s sophons were inspired by this exact idea?\nThe most brilliant part of this volume is humanity resurrecting Poole — an astronaut killed by HAL in the first book. If no one brought him up, you\u0026rsquo;d assume he was still drifting in space\u0026hellip; Resurrecting Poole not only echoes the first book\u0026rsquo;s plot but also allows us to observe and unveil the human world of the year 3001 through the eyes of an \u0026ldquo;ancient person.\u0026rdquo;\nThe Shadow of The Three-Body Problem # Or to put it the other way around: the shadow of Space Odyssey in The Three-Body Problem. I tried recalling from memory — apologies for any omissions:\nThe Sower — the Singer. In Space Odyssey, the Overlords are \u0026ldquo;planting\u0026rdquo; life; in The Three-Body Problem, the Overlords casually \u0026ldquo;eliminate\u0026rdquo; life — \u0026ldquo;What does it have to do with you?\u0026rdquo; Alien warning. \u0026ldquo;Stay away from Europa\u0026rdquo; — \u0026ldquo;Do not answer! Do not answer! Do not answer!\u0026rdquo; Display of alien technology. The Monolith — the Droplet. Both are materials beyond human comprehension, impossibly smooth, artifacts of alien civilizations that humanity\u0026rsquo;s technology cannot fathom. They represent the vast gap between human and alien technological levels. Alien life is coming. We have time to catch our breath, but it seems like nothing we do will matter. Resistance plans. With alien beings about to arrive, people begin formulating resistance plans. At this point, humanity still doesn\u0026rsquo;t know what the enemy truly looks like. In fact, the two works differ greatly in many ways. Space Odyssey is about cosmic exploration, while The Three-Body Problem is about human society as a whole facing alien civilization. Space Odyssey essentially has only a handful of protagonists, even across a thousand years, and the plot mainly revolves around Jupiter. The Three-Body Problem has a grander scale and far more characters\u0026hellip;\nFinal Thoughts # I\u0026rsquo;ve finally finished the Space Odyssey tetralogy. You can truly feel it\u0026rsquo;s a monumental work of science fiction — it satisfies a sci-fi fan\u0026rsquo;s longing for the \u0026ldquo;exploration\u0026rdquo; of space. Before any space agency had begun exploring \u0026ldquo;there,\u0026rdquo; Arthur C. Clarke had already arrived. NASA astronauts would even write back to Clarke: \u0026ldquo;We photographed the far side of the moon. There were no monoliths, no anomalies\u0026rdquo; — almost as if saying, \u0026ldquo;You fraud, I went there precisely because I read your book!\u0026rdquo; Haha~\nClarke wrote many plotlines about the Qian Xuesen spacecraft in the novels, and in the afterwords of several volumes, he repeatedly emphasized that Qian Xuesen was a person who profoundly influenced the aerospace industry — both in China and the United States. The U.S. arrested him on fabricated charges, and Qian Xuesen ultimately returned to his homeland to build its aerospace program from scratch, influencing missile development. During a trip to Beijing, Clarke even made a special attempt to visit Qian Xuesen, but at the time, Qian\u0026rsquo;s health was poor, and his doctors wouldn\u0026rsquo;t permit visitors. Clarke entrusted someone to deliver an autographed copy of Space Odyssey to Qian.\nReading the entire series, you can feel the era\u0026rsquo;s obsession with space exploration. But after the Apollo program was shut down, people seemed to lose interest in space altogether. However, with Musk\u0026rsquo;s Mars colonization plans, the theme of \u0026ldquo;space\u0026rdquo; seems to be returning to public consciousness. NASA says they\u0026rsquo;ll land on Mars by 2040 — who knows if it\u0026rsquo;s true. I\u0026rsquo;ll come back to dig up this post then.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-space-odyssey-series/","section":"Posts","summary":" Preface # An unavoidable work for any sci-fi fan: Arthur C. Clarke’s classic — the Space Odyssey series. The Space Odyssey consists of four volumes: 2001: A Space Odyssey, 2010: Odyssey Two, 2061: Odyssey Three, and 3001: The Final Odyssey. As the titles suggest, the futuristic technological visions take place in their respective years. Don’t think 2001 has already passed — when Clarke wrote 2001, it was 1968! I, at least, can’t imagine what the world will look like thirty years from now, or how far humanity will have advanced in space exploration.\n","title":"Book Notes — Space Odyssey Series","type":"posts"},{"content":" Preface # Many people probably know To Kill a Mockingbird. I saw its ratings were sky-high and couldn\u0026rsquo;t resist picking it up. Sure enough, the story is brilliant — never a dull moment. Its style is quite different from the books I\u0026rsquo;d read before. Personally, I think it\u0026rsquo;s perfectly suited for middle school readers (no condescension intended) — a simple, fun, and superbly written story. In truth, the message the whole book wants to convey is very clear: don\u0026rsquo;t harm innocent people. The real difficulty lies in how to build a brilliant story around such a simple idea.\nProse Style # I think the best thing about Mockingbird is its prose. The author brings a small-town story in the American South vividly to life. Several storylines are stunningly rendered, the plot follows the timeline smoothly without feeling muddled, and it reads effortlessly and comfortably. The story plants foreshadowing from the very beginning, only unearthing the biggest reveal right at the end. The depiction of the gap between white and Black lives is also extraordinarily vivid. This book\u0026rsquo;s setting is contemporaneous with the HBO series Boardwalk Empire, which I\u0026rsquo;d watched before — that show also features Black neighborhoods, so I could easily picture the white and Black communities.\nOne scene where the protagonist gets beaten up left a deep impression: \u0026ldquo;I was pressed to the ground, and before my eyes was a tiny ant, laboriously hauling a breadcrumb through the grass.\u0026rdquo; I find it hard to articulate exactly what this passage means. She\u0026rsquo;s being assaulted, yet her attention is caught by an ant carrying a breadcrumb? Maybe it means nothing? But whatever the case, this description makes almost everyone mentally highlight it — it\u0026rsquo;s so visually evocative. And it feels very much like stepping out for a cigarette after being immersed in stressful work for too long\u0026hellip; it yanks you from tense, urgent action into another quiet world, then immediately back again.\nAnother Prime Minister story also left a deep impression. The young protagonist asks her father: \u0026ldquo;What\u0026rsquo;s a \u0026lsquo;whore\u0026rsquo;?\u0026rdquo; Her father tells her a story about a Prime Minister blowing a feather: \u0026ldquo;Every day the Prime Minister sits in the House of Commons blowing a feather toward the ceiling, straining every sinew to keep it from drifting down — yet people around him keep losing their heads one after another.\u0026rdquo; Reading this, I was just as baffled as the protagonist. What on earth does any of this have to do with anything? I only figured it out after consulting Baidu. Her father meant: don\u0026rsquo;t obsess over irrelevant things. Which is to say, her father offered no explanation at all. But by the time of the rape trial, the young protagonist understood perfectly — she knew what \u0026ldquo;rape\u0026rdquo; meant. It\u0026rsquo;s hard to say whether this kind of evasive education is right or wrong.\nWho Killed Bob Ewell? # The final chapters are brilliantly rendered. While reading, I felt completely immersed in that pitch-black schoolyard night, that son-of-a-bitch Ewell (that\u0026rsquo;s how the sheriff refers to Ewell in the book — the first time I read that line, I silently cursed him too\u0026hellip;) hunting down two innocent children\u0026hellip; In the end, Ewell dies, but the full truth of what happened isn\u0026rsquo;t entirely clear. The narrative is told from the young protagonist\u0026rsquo;s first-person perspective, but she doesn\u0026rsquo;t see who killed Ewell.\nSince I read the e-book version, I could see many readers\u0026rsquo; annotations and comments. I found that many people completely missed the key details of the case. I was also utterly confused after my first read-through. I reread the relevant sections several times and finally pieced together the author\u0026rsquo;s intent and the full sequence of events. Let me unravel this mystery through several key questions.\n1) Is Boo Radley Black or white?\nThis question seems absurd but is critically important. If Radley were Black, then Tom Robinson\u0026rsquo;s case would be a cautionary tale — a Black man killing a white man is enough to be executed several times over. The jury wouldn\u0026rsquo;t care about the truth; the defendant being Black would be sufficient for a guilty verdict. So the old father and old sheriff\u0026rsquo;s desire to protect Radley would be perfectly natural — putting Radley through the legal process would just be throwing away a good man\u0026rsquo;s life. This would also make the novel a work primarily about Black racism.\nBut Radley is white. So none of the above applies. This also brings the novel\u0026rsquo;s content more in line with its title. The author never directly states that Radley is white, but you can put it this way: if the author doesn\u0026rsquo;t specify someone is Black, then they\u0026rsquo;re white~. Of course, there are other clues: Radley ran around with Cunningham boys (white) as a kid; he lives in a white neighborhood; his skin is deathly pale\u0026hellip; Radley is a character described from the very beginning to the very end, the most richly drawn \u0026ldquo;mockingbird\u0026rdquo; of the entire book — yet we only see his true face in the final two chapters. That\u0026rsquo;s why I felt so unsettled not knowing whether he was white\u0026hellip;\n2) The gap in the action\nHere is the passage where the young protagonist is pinned down by Ewell and ultimately saved:\n\u0026ldquo;He was slowly choking me, and I couldn\u0026rsquo;t move at all. Suddenly, he was yanked hard from behind and fell to the ground with a thud, nearly dragging me down with him. I thought, Jem must have gotten up.\nSometimes, human reactions are sluggish. I stood there dumbly, like a mute. The sounds of struggle slowly subsided. Someone was panting heavily. The night returned to its prior stillness.\n\u0026hellip;I slowly realized there were four people under the tree now.\u0026rdquo;\nFrom the moment Ewell is pulled away to the moment there are four people under the tree, a struggle took place. Afterward:\nJem (the protagonist\u0026rsquo;s older brother) lies on the ground, injured by Ewell, unconscious Ewell (the man who tried to kill children) lies dead with a kitchen knife in his ribs Radley (the man who came to save the children) leans against a tree, coughing The protagonist stands frozen, still in shock The \u0026ldquo;gap\u0026rdquo; refers to: who pulled Ewell away and killed him? What exactly happened? The subsequent discussion between Atticus and the sheriff revolves around reconstructing this gap.\n3) The kitchen knife\nFirst, the knife the sheriff uses for his demonstration is a switchblade, not the kitchen knife. \u0026ldquo;Was Ewell killed with this knife?\u0026rdquo; \u0026ldquo;No, that knife is still in him. From the handle, it\u0026rsquo;s a kitchen knife.\u0026rdquo; So the sheriff did not destroy the murder weapon — that\u0026rsquo;s a fact.\nIn any homicide, the murder weapon is an extraordinarily critical piece of evidence. Clearly, this kitchen knife is the murder weapon. Whoever brought this kitchen knife is very likely the killer. The sheriff says, \u0026ldquo;Ewell probably found that kitchen knife somewhere in the dump\u0026hellip; sharpened it razor-sharp\u0026hellip; Ewell fell on his own knife.\u0026rdquo; This is the sheriff\u0026rsquo;s subjective speculation. There are many possibilities:\nEwell brought the knife, tripped himself, and the kitchen knife stabbed into his ribs — an accidental death. Ewell brought the knife; Jem, despite his broken arm, wrestled it away and killed him. Radley brought the knife and killed Ewell. First: the probability of Ewell bringing the knife is low. If he\u0026rsquo;d brought a knife, he could have just rushed up and stabbed them — there\u0026rsquo;d be no need to go to the trouble of twisting Jem\u0026rsquo;s arm and strangling the protagonist. Now:\nScenario 1: Ewell was already fighting someone (though it\u0026rsquo;s never explicitly stated with whom). An accidental death at this point seems far-fetched, but it can\u0026rsquo;t be entirely ruled out — though the probability is extremely low. Scenario 2: Broken-arm Jem wrestles the knife away and kills Ewell. This scenario is based on the protagonist saying \u0026ldquo;it felt like Jem pulled Ewell back\u0026rdquo; — so naturally it must have been Jem fighting Ewell. The protagonist didn\u0026rsquo;t see who pulled Ewell back; she only says it \u0026ldquo;felt like\u0026rdquo; him. A thirteen-year-old boy with a freshly broken arm taking a knife from an adult and killing him — also an extremely low probability. Scenario 3: Radley brought the knife, yanked Ewell back to stop him from strangling the protagonist, then stabbed Ewell to death. This is the most likely scenario — and precisely the scenario that Atticus and the sheriff \u0026ldquo;deliberately\u0026rdquo; avoid mentioning during their reconstruction. One detail supports this: before Ewell burst out, both children had screamed. The neighbors probably didn\u0026rsquo;t hear — but earlier in the book, it\u0026rsquo;s mentioned that the tree at the scene is very close to Radley\u0026rsquo;s house. 4) The reconstruction\nWhether viewed through the novel\u0026rsquo;s themes and atmosphere, or through specific case analysis, it\u0026rsquo;s almost certain that the person who killed Ewell was Mr. Boo Radley.\nThe reconstruction dialogue between Atticus and the sheriff — I reread it multiple times; it\u0026rsquo;s absolutely fascinating. It traces the entire process of reasoning through Ewell\u0026rsquo;s death and Atticus\u0026rsquo;s and the sheriff\u0026rsquo;s psychological shifts — yet throughout, the real killer is never once named.\nAtticus wants to clarify the facts; the sheriff wants to protect the child. First, the protagonist says someone pulled Ewell away — she felt it was Jem. Based on this, Atticus deduces that Jem got up, pulled Ewell away, wrestled the knife from him, and killed him. Working from Atticus\u0026rsquo;s deduction that Jem is the killer, the sheriff wants to protect Jem and says, \u0026ldquo;Ewell fell dead on his own knife.\u0026rdquo; Atticus then says: \u0026ldquo;If we cover up the truth, that would go against everything I\u0026rsquo;ve ever taught Jem about how to be a person.\u0026rdquo; To convince Atticus, the sheriff even demonstrates the tripping scenario.\nGradually realizing the truth. A thirteen-year-old boy with a broken arm is unlikely to fight and kill an adult in the dark. \u0026ldquo;Unless someone is very accustomed to the dark to qualify as a witness\u0026hellip;\u0026rdquo; — an unmistakable hint at Radley, who never leaves his house.\nThe key piece of evidence — the kitchen knife. Atticus \u0026ldquo;suddenly\u0026rdquo; asks about the knife. The knife is still in Ewell\u0026rsquo;s body. Both of them individually realize the knife is Radley\u0026rsquo;s. They need to smooth over the knife issue. The sheriff suggests maybe Ewell found it in the dump and sharpened it.\nConfirming the lie. There are many slow-motion descriptions interwoven here. Both Atticus and the sheriff are silently checking whether this lie has any holes, whether they should accept it: Ewell\u0026rsquo;s death was an accident — he killed himself. In the end, they reach an agreement. Even the eight-year-old protagonist says, \u0026ldquo;I can understand.\u0026rdquo;\nThe title drop. \u0026ldquo;This man has done a great service for you and for this entire town. If people ignored his reclusive habits and forced him into the spotlight — I think, that would be a crime.\u0026rdquo; This line directly hits the novel\u0026rsquo;s theme. A mockingbird symbolizes an innocent, harmless person. Dragging Radley — this mockingbird — into the spotlight is a crime. It echoes what Atticus said earlier: \u0026ldquo;Remember, it\u0026rsquo;s a sin to kill a mockingbird.\u0026rdquo;\nFinal Thoughts # To Kill a Mockingbird is a relatively simple story, easy to understand — at least compared to some of the books I\u0026rsquo;ve read before. The stories of the mockingbirds in the book left a deep impression. It reminds me of the \u0026ldquo;Brother Long\u0026rdquo; self-defense case from a few years ago in China. Without Brother Long, it\u0026rsquo;s likely that killing someone in self-defense would still get you a prison sentence here. Think about America nearly a hundred years ago, when the law itself was still newly established\u0026hellip; Think about that Black man wrongly convicted, shot dead by prison guards while trying to escape. What kind of despair must he have felt in that prison cell? He just wanted to be a decent person — and his life was suddenly cut short.\nLet me close with a line from Atticus: \u0026ldquo;I think there\u0026rsquo;s just one kind of folks. Folks.\u0026rdquo;\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-to-kill-a-mockingbird/","section":"Posts","summary":" Preface # Many people probably know To Kill a Mockingbird. I saw its ratings were sky-high and couldn’t resist picking it up. Sure enough, the story is brilliant — never a dull moment. Its style is quite different from the books I’d read before. Personally, I think it’s perfectly suited for middle school readers (no condescension intended) — a simple, fun, and superbly written story. In truth, the message the whole book wants to convey is very clear: don’t harm innocent people. The real difficulty lies in how to build a brilliant story around such a simple idea.\n","title":"Book Notes — To Kill a Mockingbird","type":"posts"},{"content":" Why Write About Two Books Together? # Normally I\u0026rsquo;d write separate pieces after finishing these two books, but I figured neither would yield all that much content. Although I\u0026rsquo;ve read a few English originals (and written about them), I clearly underestimated the difficulty of When Breath Becomes Air. It\u0026rsquo;s packed with unfamiliar vocabulary — loads of medical terms I\u0026rsquo;d never encountered. I basically forced my way through it with half-understanding. As for What Life Should Mean to You\u0026hellip; it doesn\u0026rsquo;t feel as miraculous as people say. After all, it\u0026rsquo;s a century old — I didn\u0026rsquo;t extract much nourishment from it (a little, though). To avoid the awkwardness of too-thin content, I\u0026rsquo;m lumping them together.\nWhen Breath Becomes Air # The author was a surgeon with extraordinary achievements in medicine. At the peak of his career, he learned he had terminal cancer. Less than two years after the diagnosis, he passed away. This book was written during those two years. It describes, from a first-person perspective, how one confronts such misfortune as cancer and reflects on life and its meaning in one\u0026rsquo;s final days.\nWhen he learned he had terminal cancer, as a top surgeon he knew exactly what it meant. He knew he didn\u0026rsquo;t have long to live. At first, he was even angry — why did such a low-probability event happen to me? Why me? Something like this is hard for anyone to accept. But only the truly ill live day by day with the pain, quietly walking toward that inevitable but unscheduled death.\nAfter his diagnosis, he and his wife decided to have a child immediately — before chemotherapy began. The author also managed to spend a few months with his baby daughter before passing. He wanted to watch his precious daughter grow up, to know what she\u0026rsquo;d be like when she was older — though he was certain he\u0026rsquo;d never know. It seems almost too cruel.\nNear the end of the book (about twenty or thirty pages from the finish), the author\u0026rsquo;s prose abruptly stops. What follows is a chapter written by his wife, opening with: \u0026ldquo;Paul has left us\u0026hellip;\u0026rdquo; Even knowing how it would end, I couldn\u0026rsquo;t accept it — death came so suddenly that he couldn\u0026rsquo;t even finish his book\u0026hellip; But thinking about it from the book\u0026rsquo;s intended meaning, this incompleteness is, in a way, a kind of completion\u0026hellip;\nHow should we view death? If I were to die before forty, what would I do? I\u0026rsquo;d certainly be unwilling — there are too many things I haven\u0026rsquo;t finished. The author ultimately saw through the meaning of life; he believed the most important thing is to experience life and live in the present moment. I seem to be different — I live in the future, never now! If I go die right now, I\u0026rsquo;d leave this world accompanied by anger and resentment.\n(His experience inevitably reminds me of the Japanese drama The White Tower — an absolutely brilliant show! Professor Zaizen, at the peak of his career, gets cancer and ultimately donates his body for cancer pathology research\u0026hellip;)\nWhat Life Should Mean to You (Beyond Inferiority) # A famous work in psychology by Alfred Adler, founder of individual psychology. Long ago, I watched an episode of Lao Gao and Xiao Mo about Adler and individual psychology — they made it sound almost miraculous. I couldn\u0026rsquo;t resist reading it, and figured I might even analyze myself a bit.\nThe most important idea in individual psychology is: how we perceive traumatic experiences is the essence of psychological problems — not the experiences themselves causing them. But this doesn\u0026rsquo;t deny the influence of the \u0026ldquo;past\u0026rdquo; on people\u0026rsquo;s behavior.\nHowever, I personally found the book somewhat boring\u0026hellip; \u0026ldquo;a bit too humanities-oriented.\u0026rdquo; The essential differences between short chapters aren\u0026rsquo;t that significant — it\u0026rsquo;s just discussing individual psychology through different topics. I genuinely couldn\u0026rsquo;t extract substantial nourishment from it. Maybe because it was written a hundred years ago, or maybe I\u0026rsquo;m just not cut out for this.\nAdler also proposed: a group (or a couple) should think and act for the benefit of the collective, or else problems of separation will arise. If one person harbors self-serving thoughts, the group is bound to be unstable. I couldn\u0026rsquo;t agree more. I could write some self-analysis here, but I don\u0026rsquo;t want to expose myself — which is also why I felt this book note wouldn\u0026rsquo;t be very substantial.\nBefore reading this book, I also sampled The Courage to Be Disliked and How to Win Friends and Influence People. Since both had higher ratings than the \u0026ldquo;founding father\u0026rdquo; Adler\u0026rsquo;s book, I checked them out to see what they were about — and I didn\u0026rsquo;t like either. Courage is just a dialogue between two people — the classic wise-man-and-scholar format — where you learn the book\u0026rsquo;s ideas through conversation\u0026hellip; I gave up after a bit. Bestseller style. How to Win Friends was more tolerable — it directly lays out life advice in plain terms. I read about ten pieces of advice — somewhat valuable — but I still couldn\u0026rsquo;t finish it. Bestseller style too.\nClosing # I\u0026rsquo;d read a few English originals and clearly got a bit cocky — turns out I need to be realistic. Gauge the difficulty first before diving in. I\u0026rsquo;d been wanting to read psychology for a while. After reading it, I\u0026rsquo;ve learned I\u0026rsquo;m not cut out for it. Well, no matter what, I had to write this book note — recording my life, like Paul did.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-when-breath-becomes-air-what-life-should-mean-to-you/","section":"Posts","summary":"Why Write About Two Books Together? # Normally I’d write separate pieces after finishing these two books, but I figured neither would yield all that much content. Although I’ve read a few English originals (and written about them), I clearly underestimated the difficulty of When Breath Becomes Air. It’s packed with unfamiliar vocabulary — loads of medical terms I’d never encountered. I basically forced my way through it with half-understanding. As for What Life Should Mean to You… it doesn’t feel as miraculous as people say. After all, it’s a century old — I didn’t extract much nourishment from it (a little, though). To avoid the awkwardness of too-thin content, I’m lumping them together.\n","title":"Book Notes — When Breath Becomes Air \u0026 What Life Should Mean to You","type":"posts"},{"content":" Preface # I came across this book because I saw it on an Obama-recommended reading list in an e-book app. This particular title felt special, and the ratings were good, so I decided to check it out. At first, reading the synopsis — a memoir by a hiking enthusiast — I assumed the book would just describe scenic views and the hardships of sleeping rough, probably not very \u0026ldquo;exciting.\u0026rdquo; But its writing has a distinctiveness all its own; it never feels boring. Once you start a short chapter, you simply can\u0026rsquo;t stop. By the end, when I saw only 10% of the pages remained, I actually felt a sense of imminent parting — a reluctance to say goodbye. This feeling of having discovered a treasure accompanied me throughout the entire reading.\nI used to be a devotee of physical books — I liked the sense of weight and substance, and the satisfaction of finishing a paper volume. Later, as I gradually came to embrace e-books, I discovered one advantage e-books have over physical ones: links. I found this book because a book mentioned in Space Odyssey (I think) led me, via links, to \u0026ldquo;Obama\u0026rsquo;s recommendations,\u0026rdquo; and from all the Obama-recommended books I picked a few that interested me — one was Wild. The protagonist of Wild also loves to read, and she mentions several books; I bookmarked about five or six of them. So my originally barren reading list grew and flourished through this chain of links. These books are far, far better than those \u0026ldquo;Top Book Rankings\u0026rdquo; or \u0026ldquo;Essential Classics, Domestic and International.\u0026rdquo; Finishing a physical book easily leaves you wondering what to read next; e-books don\u0026rsquo;t have that problem.\nThe Queen\u0026rsquo;s Journey # After losing her mother, seeing her family fall apart, facing massive college debt, and descending into drug addiction, the author — perhaps wanting to rediscover herself — made \u0026ldquo;thorough\u0026rdquo; preparations and set out for the Pacific Crest Trail. For someone with zero hiking experience, the Pacific Crest Trail is the highest difficulty level. She called her overloaded backpack \u0026ldquo;the Monster\u0026rdquo; — so heavy she couldn\u0026rsquo;t even put it on properly. And just like that, this outdoor novice set off. Completing the entire trail takes four to six months. Along the way, you need to plan resupply points in advance, mailing food and essential supplies ahead to those locations. Once you reach a resupply point and restock, you return to the trail and press onward. The suffering on the journey — though hard to feel vicariously — you can sense how severe it was. The author alone had six toenails removed. This kind of agony, along with various unexpected incidents, is beyond what the average person can endure. That\u0026rsquo;s why the \u0026ldquo;failure rate\u0026rdquo; for people attempting this trail is very high. You need an exceptionally robust physique, a thorough plan, and some luck.\nEven with the dangers of wild animals, venomous snakes, scorching sun, glaciers, dehydration, and injuries in the wilderness, none compare to the danger of \u0026ldquo;people\u0026rdquo; — especially for a solo woman in her twenties. Once you experience the potential threat posed by humans, nature\u0026rsquo;s objective dangers almost feel like a relief. This reminds me of the plot of the HBO series The Last of Us, which I watched recently: in a post-apocalyptic world, encountering zombies isn\u0026rsquo;t the scariest thing — encountering humans is.\nThe author seems somewhat lascivious (at least by sexually conservative standards) — or maybe all Americans are this open about it. Before hitting the trail, she\u0026rsquo;d have one-night stands with many men, thoroughly enjoying the feeling, unapologetically describing this physical need and the sense of conquest in capturing a man.\nWhat I craved wasn\u0026rsquo;t someone to love, but just someone to press their body against mine.\nOn the trail, she also fantasized about attractive men she encountered and secretly watched men undress. She also packed a lot of condoms in her backpack — sadly, not a single one was used by the end. Of course, I don\u0026rsquo;t mean she was only uninhibited about physical desires; she was just as emotionally and sentimentally passionate. No right or wrong — she simply expressed exactly how she felt in the moment. I really admire this kind of honest writing.\nThe Pacific Crest Trail # The Pacific Crest Trail is one of the world\u0026rsquo;s famous long-distance trails, located in the mountain ranges of the western United States — a range jokingly called \u0026ldquo;America\u0026rsquo;s Dragon Vein\u0026rdquo;\u0026hellip; Trail information is very easy to find and extremely well-documented. The author also relied on trail guidebooks for preparation and handling unexpected situations. The trail stretches 4,000 kilometers, spanning the contiguous United States from the Canadian border to the Mexican border, passing through Washington, Oregon, and California. It is one of the National Scenic Trails.\nI have essentially zero contact with hiking — my concept of it is still nil — so I can only be an armchair traveler envying these backpackers. After a bit of searching on outdoor hiking, I found there\u0026rsquo;s a tremendous amount to learn. Outdoor hiking not only offers spectacular scenery but apparently even has therapeutic functions — I easily found hiking psychotherapy associations just by searching. Let me quote a passage from the original about the spiritual world on the trail:\nNow, I was wholly immersed in this world, living in a completely new way. Living so rootlessly, without even a roof over my head for shelter from wind and rain, made the world both much larger and much smaller.\nThe Golden Touches # While reading this book, I kept thinking of Educated: A Memoir. Both are memoirs describing a period of the authors\u0026rsquo; pasts. Not only are their writing styles similar, but their upbringings are too — growing up in isolated mountains, having an abusive father, a backward family life, and unexpectedly being extremely good at studying and getting into a top university.\nMore importantly, their writing style easily captures the reader\u0026rsquo;s emotions without ever feeling boring or stifling. I haven\u0026rsquo;t managed to fully summarize how they write, but one thing I paid special attention to: the accumulation of emotion followed by the unexpected move.\nFor example: when the author scatters her mother\u0026rsquo;s ashes into the earth, she keeps a few larger fragments of bone, unable to let go. Finally, she puts these unburned bone fragments into her mouth and swallows them.\nI was stunned reading this. Throughout the book, she describes her feelings for her mother in many places. Her mother\u0026rsquo;s death affected her profoundly. After flatly (or perhaps despairingly) describing her mother\u0026rsquo;s death and cremation, unable to let go, she chooses to swallow her mother\u0026rsquo;s bones into her stomach — so she can become one with her mother! What kind of emotion could drive such an act — one that most people would find impossible to accept — as a vessel for such heavy feeling? This swallowing motion conveys far more powerfully than endlessly repeated expressions of longing ever could, and it grabs the reader\u0026rsquo;s attention far more effectively.\nThere\u0026rsquo;s also a passage about condoms. An older backpacker, seeing how much stuff she\u0026rsquo;s carrying, helps her sort through her pack, throwing out things that are completely useless. The old backpacker finds a big packet of condoms: \u0026ldquo;Are you sure you need these?\u0026rdquo; Having gained some trail experience, she knows the stuff is utterly useless — but as she throws out the big pack, she secretly keeps one~ Then, the next morning when she wakes up, that one condom is gone\u0026hellip;\nThese plot points are so dramatized I almost suspected they were fabricated. But I carefully read the author\u0026rsquo;s preface — she says she merely omitted certain scenes and guarantees that the events are all true.\nRegardless, a touch of plot that slightly exceeds realistic logic is essential in writing — it grabs the reader\u0026rsquo;s heart. The authenticity of these \u0026ldquo;golden touches\u0026rdquo; themselves isn\u0026rsquo;t important; what matters is having that touch. Let me give an example from one of my favorite films, Memories of Murder, which I\u0026rsquo;m sure many have seen. Years later, the old detective returns to the crime scene and meets a child. The child says someone else was just here, crouching and staring at this drainage ditch just like you. The old detective immediately realizes this person could be the murderer. He asks the child what the person looked like. The child says: \u0026ldquo;Just\u0026hellip; ordinary.\u0026rdquo; This moment is a stroke of genius. Many viewers obsess over who the killer actually is, but it doesn\u0026rsquo;t matter who it is. \u0026ldquo;The murderer is ordinary\u0026rdquo; — that\u0026rsquo;s what the film is trying to say.\nFinal Thoughts # What could have been a boring story was written into a captivating work, with genuine depth, rich and authentic emotion. It\u0026rsquo;s a memoir of following the author back to nature and rediscovering the self — absolutely worth reading!\nRecently, good books have been streaming in nonstop; my bookshelf is quite packed. But I\u0026rsquo;m not worried about them gathering dust at all, because I believe the quality of these books matches this one — reaching that \u0026ldquo;can\u0026rsquo;t-put-it-down\u0026rdquo; level, requiring no self-discipline to become completely immersed. Let me quote from a hiking expert\u0026rsquo;s blog:\n\u0026gt; What was your favorite stretch of scenery? \u0026gt; The next one.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/book-notes-wild-from-lost-to-found-on-the-pacific-crest-trail/","section":"Posts","summary":" Preface # I came across this book because I saw it on an Obama-recommended reading list in an e-book app. This particular title felt special, and the ratings were good, so I decided to check it out. At first, reading the synopsis — a memoir by a hiking enthusiast — I assumed the book would just describe scenic views and the hardships of sleeping rough, probably not very “exciting.” But its writing has a distinctiveness all its own; it never feels boring. Once you start a short chapter, you simply can’t stop. By the end, when I saw only 10% of the pages remained, I actually felt a sense of imminent parting — a reluctance to say goodbye. This feeling of having discovered a treasure accompanied me throughout the entire reading.\n","title":"Book Notes — Wild: From Lost to Found on the Pacific Crest Trail","type":"posts"},{"content":"The business team reported that INSERT VALUES occasionally became slow. By the time I checked the active sessions, the slow write problem had already subsided.\nLater, I discovered that the slow write problem lasted less than half a minute, with INSERT VALUES taking 1-2 seconds. I wrote a script to capture active session information and managed to get the session data:\nwait_event | count ---------------------+------- [null] | 11 WALRead | 1 DataFileRead | 2 BgWriterMain | 1 WALWrite | 40 AutoVacuumMain | 1 ClientRead | 385 LogicalLauncherMain | 1 The most abnormal wait event was WALWrite with 40 sessions.\nTwo of the WALWrite-waiting sessions looked like this:\npid | usename | xact_start | state_change | wait_event | wait_event_type | state | partofquery -------+----------+-------------------------------+-------------------------------+---------------+-----------------+--------+-------------------------------------------------------------- 144955 | lzluser11 | 2024-05-23 07:58:27.516574+08 | 2024-05-23 07:58:27.516588+08 | WALWrite | LWLock | active | insert into table1( 179869 | lzluser11 | 2024-05-23 07:58:28.116371+08 | 2024-05-23 07:58:28.116386+08 | WALWrite | IO | active | insert into table1( Let\u0026rsquo;s search the source code for WALWrite-related content:\n* WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or * XLogFlush). /* * LWLockAcquireOrWait - Acquire lock, or wait until it\u0026#39;s free * * The semantics of this function are a bit funky. If the lock is currently * free, it is acquired in the given mode, and the function returns true. If * the lock isn\u0026#39;t immediately free, the function waits until it is released * and returns false, but does not acquire the lock. * * This is currently used for WALWriteLock: when a backend flushes the WAL, * holding WALWriteLock, it can flush the commit records of many other * backends as a side-effect. Those other backends need to wait until the * flush finishes, but don\u0026#39;t need to acquire the lock anymore. They can just * wake up, observe that their records have already been flushed, and return. */ When WAL is written from WAL buffers to disk, the WALWriteLock must be held.\nWhen a backend flushes WAL while holding WALWriteLock, it can also flush the commit records of other backends. Those other backends need to wait for this flush to finish, but they don\u0026rsquo;t need to acquire the lock afterward. If their WAL has been flushed, they can return directly (rather than flushing WAL again).\nXLogFlush is extremely important. The key code in XLogFlush is in the for loop:\n/* * Ensure that all XLOG data through the given position is flushed to disk. * * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not * already held, and we try to avoid acquiring it if possible. */ void XLogFlush(XLogRecPtr record) { ... /* * Now wait until we get the write lock, or someone else does the flush * for us. */ for (;;) { XLogRecPtr\tinsertpos; /* read LogwrtResult and update local state */ SpinLockAcquire(\u0026amp;XLogCtl-\u0026gt;info_lck); if (WriteRqstPtr \u0026lt; XLogCtl-\u0026gt;LogwrtRqst.Write) WriteRqstPtr = XLogCtl-\u0026gt;LogwrtRqst.Write; LogwrtResult = XLogCtl-\u0026gt;LogwrtResult; SpinLockRelease(\u0026amp;XLogCtl-\u0026gt;info_lck); /* done already? */ if (record \u0026lt;= LogwrtResult.Flush) break; /* * Before actually performing the write, wait for all in-flight * insertions to the pages we\u0026#39;re about to write to finish. */ insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr); /* * Try to get the write lock. If we can\u0026#39;t get it immediately, wait * until it\u0026#39;s released, and recheck if we still need to do the flush * or if the backend that held the lock did it for us already. This * helps to maintain a good rate of group committing when the system * is bottlenecked by the speed of fsyncing. */ if (!LWLockAcquireOrWait(WALWriteLock, LW_EXCLUSIVE)) { /* * The lock is now free, but we didn\u0026#39;t acquire it yet. Before we * do, loop back to check if someone else flushed the record for * us already. */ continue; } /* Got the lock; recheck whether request is satisfied */ LogwrtResult = XLogCtl-\u0026gt;LogwrtResult; if (record \u0026lt;= LogwrtResult.Flush) { LWLockRelease(WALWriteLock); break; } /* * Sleep before flush! By adding a delay here, we may give further * backends the opportunity to join the backlog of group commit * followers; this can significantly improve transaction throughput, * at the risk of increasing transaction latency. * * We do not sleep if enableFsync is not turned on, nor if there are * fewer than CommitSiblings other backends with active transactions. */ if (CommitDelay \u0026gt; 0 \u0026amp;\u0026amp; enableFsync \u0026amp;\u0026amp; MinimumActiveBackends(CommitSiblings)) { pg_usleep(CommitDelay); /* * Re-check how far we can now flush the WAL. It\u0026#39;s generally not * safe to call WaitXLogInsertionsToFinish while holding * WALWriteLock, because an in-progress insertion might need to * also grab WALWriteLock to make progress. But we know that all * the insertions up to insertpos have already finished, because * that\u0026#39;s what the earlier WaitXLogInsertionsToFinish() returned. * We\u0026#39;re only calling it again to allow insertpos to be moved * further forward, not to actually wait for anyone. */ insertpos = WaitXLogInsertionsToFinish(insertpos); } /* try to write/flush later additions to XLOG as well */ WriteRqst.Write = insertpos; WriteRqst.Flush = insertpos; XLogWrite(WriteRqst, false); LWLockRelease(WALWriteLock); /* done */ break; } ... } The XLogFlush function is the main function for flushing dirty WAL:\nCheck if the dirty WAL that needs to be flushed has already been flushed by someone else. If so, return directly. Try to acquire the lock WALWriteLock in exclusive mode, retrying continuously until the lock is acquired. Once the lock is acquired, check again if the dirty WAL that needs to be flushed has already been flushed by someone else. If so, release WALWriteLock and return (during the lock acquisition wait, someone else might have flushed the dirty WAL — if so, there\u0026rsquo;s nothing to do). Wait for commit_delay milliseconds, and if the number of concurrent committing transactions exceeds commit_siblings, update the WAL write position to form a group commit. This step currently doesn\u0026rsquo;t apply because CommitDelay defaults to 0, effectively meaning group commit is not enabled. Call XLogWrite to write the log, release WALWriteLock after completion. XLogFlush for flushing dirty WAL needs to check whether the currently requested dirty WAL has already been written. If not, it will hold WALWriteLock until the XLogWrite function completes writing the log. XLogWrite is the specific function for writing WAL, such as writing to which position on which page.\nReturning to the wait events from active sessions, the IO:WALWrite wait is relatively easy to understand, but how do we confirm whether LWLock:WALWrite is a problem?\nFrom the XLogFlush function logic, we know that WALWriteLock is an exclusive LWLock that PostgreSQL acquires when writing dirty WAL (this makes sense — WAL commit information is written sequentially and can only be written in exclusive mode; you can\u0026rsquo;t let whoever writes fastest write first, as that could easily corrupt data). It\u0026rsquo;s a serialized write of WAL commit information.\nUnderstanding this part of the logic, looking back at pg_stat_activity, we can see that there was only 1 IO:WALWrite, while there were dozens of LWLock:WALWrite waits.\nAlthough we can\u0026rsquo;t directly see the LWLock blocking chain, we can infer from the source code that LWLock:WALWrite is waiting on IO:WALWrite.\nThe official documentation has a section about XLogFlush and adjusting WAL buffers:\nNormally, WAL buffers should be written and flushed by an XLogFlush request, which is made, for the most part, at transaction commit time to ensure that transaction records are flushed to permanent storage. On systems with high WAL output, XLogFlush requests might not occur often enough to prevent XLogInsertRecord from having to do writes. On such systems one should increase the number of WAL buffers by modifying the wal_buffers parameter. When full_page_writes is set and the system is very busy, setting wal_buffers higher will help smooth response times during the period immediately following each checkpoint.\nUnder normal circumstances, WAL buffers are flushed by XLogFlush, for example during transaction commit to write WAL logs to disk. If the WAL log volume is large but XLogFlush is not triggered frequently enough (meaning mostly large transactions), XLogInsertRecord needs to write uncommitted WAL records — meaning the WAL buffer is insufficient. In this case, increasing wal_buffers may slightly help with system response time.\nThere are two commonly used internal WAL functions: XLogInsertRecord and XLogFlush. XLogInsertRecord is used to place a new record into the WAL buffers in shared memory. If there is no space for the new record, XLogInsertRecord will have to write (move to kernel cache) a few filled WAL buffers\nCombined with a description from the XLogInsertRecord function:\n* We have now done all the preparatory work we can without holding a * lock or modifying shared state. From here on, inserting the new WAL * record to the shared WAL buffer cache is a two-step process: * * 1. Reserve the right amount of space from the WAL. The current head of *\treserved space is kept in Insert-\u0026gt;CurrBytePos, and is protected by *\tinsertpos_lck. * * 2. Copy the record to the reserved WAL space. This involves finding the *\tcorrect WAL buffer containing the reserved space, and copying the *\trecord in place. This can be done concurrently in multiple processes. The XLogInsertRecord function is used to place new WAL records into the WAL buffer:\nWriting requires reserving a certain amount of space. Copy the WAL record to the reserved WAL space (presumably the reserved space in the WAL buffer). Multiple processes can execute this in parallel. Copying WAL records to the WAL buffer can be done in parallel. This is unlikely to be a bottleneck since it\u0026rsquo;s an in-memory copy with parallelism.\nBut XLogFlush is different — it holds an exclusive LWLock throughout the write. So, in scenarios with high concurrency and small transactions, increasing WAL buffers theoretically won\u0026rsquo;t be very effective.\nAt this point, we can rule out wal_buffers memory tuning and focus our attention on I/O. Looking at the I/O-related wait counts in pg_stat_activity:\nDataFileRead\t4 DataFileExtend\t1 WALWrite\t1 WALRead\t1 The INSERT VALUES slowness lasted less than a minute and was not normally present. However, looking at the normal session information, I/O class WALWrite waits were almost always there:\npid | usename | xact_start | state_change | wait_event | wait_event_type | state | partofquery -------+----------+-------------------------------+-------------------------------+---------------+-----------------+--------+-------------------------------------------------------------- 72668 | lzluser11 | 2024-05-23 09:32:20.828394+08 | 2024-05-23 09:32:20.82841+08 | WALWrite | IO | active | insert into table1( + 77215 | lzluser11 | 2024-05-23 09:33:01.342541+08 | 2024-05-23 09:33:01.342552+08 | WALWrite | IO | active | insert into table1 + 94904 | lzluser11 | 2024-05-23 09:34:32.442309+08 | 2024-05-23 09:34:32.442323+08 | WALWrite | IO | active | insert into table1 + 88024 | lzluser11 | 2024-05-23 09:36:28.779086+08 | 2024-05-23 09:36:28.779311+08 | WALWrite | IO | active | UPDATE table2 SET + 103236 | lzluser11 | 2024-05-23 09:37:04.144283+08 | 2024-05-23 09:37:04.144302+08 | WALWrite | IO | active | insert into table1 + 47342 | lzluser11 | 2024-05-23 09:37:09.192683+08 | 2024-05-23 09:37:09.192699+08 | WALWrite | IO | active | insert into table1 + 75399 | lzluser11 | 2024-05-23 09:45:30.743023+08 | 2024-05-23 09:45:30.743024+08 | WALWrite | IO | active | update table1 + 221993 | lzluser11 | 2024-05-23 09:46:16.184532+08 | 2024-05-23 09:46:16.184541+08 | WALWrite | IO | active | insert into table1 However, checking the I/O performance at that time, writing 15 MB/s was not high — in fact, it was relatively low compared to other time periods, and w_await was also very low:\nDevice: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-322 0.00 0.00 187.00 1515.00 3572.00 15344.00 22.23 2.05 1.20 9.39 0.18 0.15 25.70 There was no strong evidence pointing to a storage performance issue.\nAt present, it appears to be transient lock contention during concurrent INSERT VALUES small transactions when flushing WAL. We can rule out the following options:\nConcurrent small transactions — no need to adjust WAL buffers WAL log volume is not large — no need to enable log compression Not many FPIs (Full Page Images) — no need to adjust checkpoint I/O pressure is not high — no need to improve I/O performance At minimum, the following optimizations can be made:\nEnable database group commit (can be deferred if concerned about risk; testing required) Batch multiple INSERT VALUES statements at the application level to reduce WALWriteLock contention ","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/case-study-analyzing-occasional-slow-insert-values/","section":"Posts","summary":"The business team reported that INSERT VALUES occasionally became slow. By the time I checked the active sessions, the slow write problem had already subsided.\nLater, I discovered that the slow write problem lasted less than half a minute, with INSERT VALUES taking 1-2 seconds. I wrote a script to capture active session information and managed to get the session data:\nwait_event | count ---------------------+------- [null] | 11 WALRead | 1 DataFileRead | 2 BgWriterMain | 1 WALWrite | 40 AutoVacuumMain | 1 ClientRead | 385 LogicalLauncherMain | 1 The most abnormal wait event was WALWrite with 40 sessions.\n","title":"Case Study: Analyzing Occasional Slow INSERT VALUES","type":"posts"},{"content":" Problem Symptoms # The backup process (pg_start_backup()) was blocked by the checkpointer, and the checkpointer was blocked by the logical replication walsender. The database was still serving queries, but backup, checkpoint, and logical replication were all completely hung.\nTwo processes in pg_stat_activity showed an unusual wait event: replication_slot_io.\n[postgres@hostlzl:6666/postgres][04-08.16:50:28]=\u0026gt; select * from pg_stat_activity where pid=173038 \\gx -[ RECORD 1 ]----+------------------------------ datid | 17630 datname | lzldb pid | 173038 usesysid | 35157 usename | repuser application_name | PostgreSQL JDBC Driver client_addr | 30.88.75.58 client_hostname | [null] client_port | 37623 backend_start | 2024-04-02 11:40:07.75022+08 xact_start | [null] query_start | [null] state_change | 2024-04-02 11:40:07.764475+08 wait_event_type | LWLock wait_event | replication_slot_io state | active backend_xid | [null] backend_xmin | [null] query | backend_type | walsender Time: 6.658 ms [postgres@hostlzl:6666/postgres][04-08.16:50:34]=\u0026gt; select * from pg_stat_activity where pid=12729\\gx -[ RECORD 1 ]----+------------------------------ datid | [null] datname | [null] pid | 12729 usesysid | [null] usename | [null] application_name | client_addr | [null] client_hostname | [null] client_port | [null] backend_start | 2024-04-02 11:23:03.343116+08 xact_start | [null] query_start | [null] state_change | [null] wait_event_type | LWLock wait_event | replication_slot_io state | [null] backend_xid | [null] backend_xmin | [null] query | backend_type | checkpointer One walsender and one checkpointer. Both were started on April 2. Let\u0026rsquo;s check the walsender 173038 logs:\n--repuser log 2024-04-02 11:40:07.750 CST,,,173038,\u0026#34;30.88.75.58:37623\u0026#34;,660b7e17.2a3ee,1,\u0026#34;\u0026#34;,2024-04-02 11:40:07 CST,,0,LOG,00000,\u0026#34;connection received: host=30.88.75.58 port=37623\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; 2024-04-02 11:40:07.756 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,173038,\u0026#34;30.88.75.58:37623\u0026#34;,660b7e17.2a3ee,2,\u0026#34;authentication\u0026#34;,2024-04-02 11:40:07 CST,32/30,0,LOG,00000,\u0026#34;replication connection authorized: user=repuser\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; 2024-04-02 11:40:07.765 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,173038,\u0026#34;30.88.75.58:37623\u0026#34;,660b7e17.2a3ee,3,\u0026#34;idle\u0026#34;,2024-04-02 11:40:07 CST,32/0,0,LOG,00000,\u0026#34;starting logical decoding for slot \u0026#34;\u0026#34;pg_lzldb_lzldb_ora_pgdb_pgdb\u0026#34;\u0026#34;\u0026#34;,\u0026#34;Streaming transactions committing after 4263/42E6EF88, reading WAL from 4263/41DAEBD0.\u0026#34;,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34; 2024-04-02 11:40:07.765 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,173038,\u0026#34;30.88.75.58:37623\u0026#34;,660b7e17.2a3ee,4,\u0026#34;idle\u0026#34;,2024-04-02 11:40:07 CST,32/0,0,LOG,00000,\u0026#34;logical decoding found consistent point at 4263/41DAEBD0\u0026#34;,\u0026#34;There are no running transactions.\u0026#34;,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34; Walsender 173038 only shows startup information. After that, no more log output — it likely hung from the very start.\nScrolling back a bit, we can find an earlier walsender for the same replication slot (different PID, same slot name):\n--84918 earlier startup logs 2024-04-02 11:30:07.498 CST,,,84918,\u0026#34;30.88.75.58:54898\u0026#34;,660b7bbf.14bb6,1,\u0026#34;\u0026#34;,2024-04-02 11:30:07 CST,,0,LOG,00000,\u0026#34;connection received: host=30.88.75.58 port=54898\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; 2024-04-02 11:30:07.504 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,84918,\u0026#34;30.88.75.58:54898\u0026#34;,660b7bbf.14bb6,2,\u0026#34;authentication\u0026#34;,2024-04-02 11:30:07 CST,30/3,0,LOG,00000,\u0026#34;replication connection authorized: user=repuser\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; 2024-04-02 11:30:07.514 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,84918,\u0026#34;30.88.75.58:54898\u0026#34;,660b7bbf.14bb6,3,\u0026#34;idle\u0026#34;,2024-04-02 11:30:07 CST,30/0,0,LOG,00000,\u0026#34;starting logical decoding for slot \u0026#34;\u0026#34;pg_lzldb_lzldb_ora_pgdb_pgdb\u0026#34;\u0026#34;\u0026#34;,\u0026#34;Streaming transactions committing after 4263/41DADE38, reading WAL from 4263/358F1340.\u0026#34;,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34; 2024-04-02 11:30:07.516 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,84918,\u0026#34;30.88.75.58:54898\u0026#34;,660b7bbf.14bb6,4,\u0026#34;idle\u0026#34;,2024-04-02 11:30:07 CST,30/0,0,LOG,00000,\u0026#34;logical decoding found consistent point at 4263/358F1340\u0026#34;,\u0026#34;There are no running transactions.\u0026#34;,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34; 2024-04-02 11:36:07.061 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,86630,\u0026#34;30.88.75.58:45227\u0026#34;,660b7bca.15266,5,\u0026#34;idle\u0026#34;,2024-04-02 11:30:18 CST,30/0,0,ERROR,XX000,\u0026#34;could not write to file \u0026#34;\u0026#34;pg_replslot/pg_lzldb_lzldb_ora_pgdb_pgdb/state.tmp\u0026#34;\u0026#34;: Cannot allocate memory\u0026#34;,,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34; 2024-04-02 11:36:40.151 CST,\u0026#34;repuser\u0026#34;,\u0026#34;lzldb\u0026#34;,86630,\u0026#34;30.88.75.58:45227\u0026#34;,660b7bca.15266,6,\u0026#34;idle\u0026#34;,2024-04-02 11:30:18 CST,,0,LOG,00000,\u0026#34;disconnection: session time: 0:06:21.760 user=repuser database=lzldb host=30.88.75.58 port=45227\u0026#34;,,,,,,,,,\u0026#34;PostgreSQL JDBC Driver\u0026#34; This replication slot was also started at 11:30:07. Six minutes later, it failed to write state.tmp due to memory exhaustion.\nThe checkpointer process 12729 also reported the same state.tmp error — \u0026quot;pg_replslot/pg_lzldb_lzldb_ora_pgdb_pgdb/state.tmp\u0026quot;\u0026quot;: File exists\u0026quot;. This error appeared ~30 seconds after the replication slot error:\n--checkpoint log 2024-04-02 11:36:39.925 CST,,,12729,,660b7a17.31b9,4,,2024-04-02 11:23:03 CST,,0,LOG,58P02,\u0026#34;could not create file \u0026#34;\u0026#34;pg_replslot/pg_lzldb_lzldb_ora_pgdb_pgdb/state.tmp\u0026#34;\u0026#34;: File exists\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; 2024-04-02 11:36:40.151 CST,,,12729,,660b7a17.31b9,5,,2024-04-02 11:23:03 CST,,0,LOG,00000,\u0026#34;checkpoint complete: wrote 334 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.108 s, sync=0.082 s, total=217.083 s; sync files=139, longest=0.004 s, average=0.000 s; distance=2295 kB, estimate=2295 kB\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; 2024-04-02 11:48:03.414 CST,,,12729,,660b7a17.31b9,6,,2024-04-02 11:23:03 CST,,0,LOG,00000,\u0026#34;checkpoint starting: time\u0026#34;,,,,,,,,,\u0026#34;\u0026#34; After this, the checkpointer produced no more log output — it hung, just like the walsender.\nSearching for pg_replslot/pg_lzldb_lzldb_ora_pgdb_pgdb/state.tmp\u0026quot;\u0026quot;: File exists\u0026quot; quickly leads to a community thread: https://www.postgresql.org/message-id/14b3454f-2d68-c637-68e4-2b42ff976168%40postgrespro.ru\nThe actual fix landed in PG 12.3:\nEnsure that a replication slot\u0026rsquo;s io_in_progress_lock is released in failure code paths (Pavan Deolasee) This could result in a walsender later becoming stuck waiting for the lock.\nDeep Dive # We found the bug, but several questions remain:\nWhy did the walsender and checkpointer hang? Who is blocking whom — the walsender or the checkpointer? How was this triggered? What are the solutions? Source Code Analysis # Current version: 11.5.\nPstack of both processes:\n[postgres@hostlzl:lzldb:6666: /pg/pg6666/data/pg_log]$ pstack 173038 ##walsender #0 0x00002b9eec171a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00002b9eec171a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00002b9eec171b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00000000006b2512 in PGSemaphoreLock (sema=0x2b9ef5fdb0b8) at pg_sema.c:316 #4 0x000000000071e94c in LWLockAcquire (lock=lock@entry=0x2babd8cee5b8, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1243 #5 0x00000000006ef7cb in SaveSlotToPath (slot=0x2babd8cee500, dir=dir@entry=0x7ffcaffd79f0 \u0026#34;pg_replslot/pg_lzldb_lzldb_ora_pgdb_pgdb\u0026#34;, elevel=elevel@entry=20) at slot.c:1249 #6 0x00000000006f0515 in ReplicationSlotSave () at slot.c:653 #7 0x00000000006d75d8 in LogicalConfirmReceivedLocation (lsn=\u0026lt;optimized out\u0026gt;) at logical.c:1049 #8 0x00000000006d774d in LogicalIncreaseXminForSlot (current_lsn=current_lsn@entry=72994075200640, xmin=xmin@entry=1241611955) at logical.c:914 #9 0x00000000006e0fb3 in SnapBuildProcessRunningXacts (builder=builder@entry=0x23146c0, lsn=72994075200640, running=running@entry=0x22e8978) at snapbuild.c:1146 #10 0x00000000006d484c in DecodeStandbyOp (buf=0x7ffcaffd7eb0, buf=0x7ffcaffd7eb0, ctx=0x2209820) at decode.c:318 #11 LogicalDecodingProcessRecord (ctx=0x2209820, record=\u0026lt;optimized out\u0026gt;) at decode.c:121 #12 0x00000000006e50e0 in XLogSendLogical () at walsender.c:2799 #13 0x00000000006e7122 in WalSndLoop (send_data=send_data@entry=0x6e5080 \u0026lt;XLogSendLogical\u0026gt;) at walsender.c:2162 #14 0x00000000006e7d91 in StartLogicalReplication (cmd=0x22eedd8) at walsender.c:1109 #15 exec_replication_command (cmd_string=cmd_string@entry=0x2210c48 \u0026#34;START_REPLICATION SLOT pg_lzldb_lzldb_ora_pgdb_pgdb LOGICAL 4263/42E6EF88 (\\\u0026#34;add-tables\\\u0026#34; \u0026#39;public.acr_finance_coa_partition_17_01,public.acr_finance_coa_partition_17_02,public.acr_finance_coa_part\u0026#34;...) at walsender.c:1541 #16 0x000000000072c899 in PostgresMain (argc=\u0026lt;optimized out\u0026gt;, argv=argv@entry=0x2216f78, dbname=0x2216c98 \u0026#34;lzldb\u0026#34;, username=\u0026lt;optimized out\u0026gt;) at postgres.c:4178 #17 0x000000000047e481 in BackendRun (port=0x20eda0) at postmaster.c:4358 #18 BackendStartup (port=0x20eda0) at postmaster.c:4030 #19 ServerLoop () at postmaster.c:1707 #20 0x00000000006c4359 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x21dbe90) at postmaster.c:1380 #21 0x000000000047eefb in main (argc=3, argv=0x21dbe90) at main.c:228 [postgres@hostlzl:lzldb:6666: /pg/pg6666/data/pg_wal]$ pstack 12729 ##checkpointer #0 0x00002b9eec171a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00002b9eec171a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00002b9eec171b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00000000006b2512 in PGSemaphoreLock (sema=0x2b9ef5fdcd38) at pg_sema.c:316 #4 0x000000000071e94c in LWLockAcquire (lock=lock@entry=0x2babd8cee5b8, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1243 #5 0x00000000006ef7cb in SaveSlotToPath (slot=slot@entry=0x2babd8cee500, dir=dir@entry=0x7ffcaffd6ee0 \u0026#34;pg_replslot/pg_lzldb_lzldb_ora_pgdb_pgdb\u0026#34;, elevel=elevel@entry=15) at slot.c:1249 #6 0x00000000006f11a7 in CheckPointReplicationSlots () at slot.c:1100 #7 0x00000000004f674f in CheckPointGuts (checkPointRedo=72994093982360, flags=flags@entry=128) at xlog.c:9146 #8 0x00000000004fcc77 in CreateCheckPoint (flags=flags@entry=128) at xlog.c:8937 #9 0x00000000006b8312 in CheckpointerMain () at checkpointer.c:491 #10 0x000000000050ba15 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffcaffd7540) at bootstrap.c:451 #11 0x00000000006c1cb9 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337 #12 0x00000000006c2f5a in reaper (postgres_signal_arg=\u0026lt;optimized out\u0026gt;) at postmaster.c:2867 #13 \u0026lt;signal handler called\u0026gt; #14 0x00002b9eed5ba783 in __select_nocancel () from /lib64/libc.so.6 #15 0x000000000047db38 in ServerLoop () at postmaster.c:1671 #16 0x00000000006c4359 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x21dbe90) at postmaster.c:1380 #17 0x000000000047eefb in main (argc=3, argv=0x21dbe90) at main.c:228 The key observation is the LWLockAcquire frame. Both the walsender and the checkpointer are trying to acquire the same LWLOCK address in exclusive mode: lock=lock@entry=0x2babd8cee5b8, mode=mode@entry=LW_EXCLUSIVE — waiting indefinitely.\nThe function right above LWLockAcquire is SaveSlotToPath.\nLooking at the source in src/backend/replication/slot.c, the critical function SaveSlotToPath:\n//SaveSlotToPath stores slot state static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel) {\t//11.5 code char\ttmppath[MAXPGPATH]; char\tpath[MAXPGPATH]; int\tfd; ReplicationSlotOnDisk cp; bool\twas_dirty; ... /* and don\u0026#39;t do anything if there\u0026#39;s nothing to write */ if (!was_dirty) return; //Acquire LWLock in exclusive mode at function entry LWLockAcquire(\u0026amp;slot-\u0026gt;io_in_progress_lock, LW_EXCLUSIVE); ... //Note the fd logic — the error matches the second walsender error fd = OpenTransientFile(tmppath, O_CREAT | O_EXCL | O_WRONLY | PG_BINARY); if (fd \u0026lt; 0) { ereport(elevel, (errcode_for_file_access(), errmsg(\u0026#34;could not create file \\\u0026#34;%s\\\u0026#34;: %m\u0026#34;, tmppath))); return; ... //The logic for writing to fd — the error matches the first walsender error if ((write(fd, \u0026amp;cp, sizeof(cp))) != sizeof(cp)) { int\tsave_errno = errno; pgstat_report_wait_end(); CloseTransientFile(fd); /* if write didn\u0026#39;t set errno, assume problem is no disk space */ errno = save_errno ? save_errno : ENOSPC; ereport(elevel, (errcode_for_file_access(), errmsg(\u0026#34;could not write to file \\\u0026#34;%s\\\u0026#34;: %m\u0026#34;, tmppath))); return; } ... LWLockRelease(\u0026amp;slot-\u0026gt;io_in_progress_lock);\t//Release LWLock at end of function } SaveSlotToPath acquires LWLockAcquire on the slot\u0026rsquo;s io_in_progress_lock in LW_EXCLUSIVE mode — very similar to the wait event name: io_in_progress_lock ↔ replication_slot_io.\nAt the end of the function, LWLockRelease releases the lock.\nBut in both if branches, there is no LWLockRelease — the function just returns directly!\nThe PostgreSQL log shows \u0026ldquo;could not create file\u0026rdquo; for tmppath, meaning the code hit one of those two if branches — either the write to state.tmp failed branch or the create state.tmp failed branch.\nLet\u0026rsquo;s reconstruct the timeline from the logs:\n11:36:07: Logical replication first error — \u0026ldquo;could not write to file \u0026hellip; state.tmp\u0026rdquo;. Replication link dies. 11:36:39: Checkpointer error — \u0026ldquo;could not create file \u0026hellip; state.tmp\u0026rdquo;. One second later, checkpoint \u0026ldquo;completes\u0026rdquo; with 0 dirty buffers, 0 WAL. 11:40:07: Logical replication starts again. No more output. 11:48:03: Checkpointer triggers start again. No more output. Important: the first and second logical replication connections belong to different walsender PIDs; the first and second checkpoint entries belong to the same checkpointer PID.\nFault mechanism reconstructed:\nLogical replication walsender, due to memory pressure, fails to write state.tmp, leaving a residual state.tmp file behind. The checkpointer, encountering the residual state.tmp, enters the if (fd \u0026lt; 0) branch in SaveSlotToPath after acquiring the LWLock in exclusive mode — and returns without releasing the LWLock. A new walsender starts for logical replication and tries to acquire the LWLock at the top of SaveSlotToPath — waits indefinitely. The checkpointer triggers a new checkpoint and also tries to acquire the LWLock at the top of SaveSlotToPath — waits indefinitely. With the mechanism clear, the answers follow:\nWhy did the walsender and checkpointer hang? Residual state.tmp. The checkpointer held the LWLock without releasing it. Both walsender and checkpointer wait indefinitely. Who blocks whom? The checkpointer blocks the walsender. How was it triggered? The previous walsender exhausted memory, leaving an uncleaned state.tmp. Solutions? Force restart the database. Reproduction # For background on PostgreSQL logical replication, refer to: PG inner workings: Logical Replication. Key commands:\nselect pg_create_logical_replication_slot(\u0026#39;logical_test\u0026#39;,\u0026#39;test_decoding\u0026#39;); pg_recvlogical -h 127.0.0.1 -p 5558 -d lzldb -U lzl --slot=logical_test --start -f recv.sql \u0026amp; The slot and replication link are ready:\npostgres=# select pid,usename,xact_start,state_change,wait_event,state,query from pg_stat_activity where state\u0026lt;\u0026gt;\u0026#39;idle\u0026#39; order by xact_start ; pid | usename | xact_start | state_change | wait_event | state | query -------+----------+-------------------------------+-------------------------------+---------------------+--------+---------------------------------------------------------------------------------------------- -------------------------------------- 59916 | postgres | 2024-04-08 21:14:32.015534+08 | 2024-04-08 21:14:32.015545+08 | | active | select pid,usename,xact_start,state_change,wait_event,state,query from pg_stat_activity wher e state\u0026lt;\u0026gt;\u0026#39;idle\u0026#39; order by xact_start ; 59791 | lzl | | 2024-04-08 21:14:19.566112+08 | WalSenderWaitForWAL | active | SELECT pg_catalog.set_config(\u0026#39;search_path\u0026#39;, \u0026#39;\u0026#39;, false) postgres=# select pid,usename,application_name,backend_start,state,pg_walfile_name_offset(sent_lsn) sentoffset,pg_walfile_name_offset(write_lsn) writeoffset,pg_walfile_name_offset(flush_lsn) flushoffset from pg_stat_replication; pid | usename | application_name | backend_start | state | sentoffset | writeoffset | flushoffset -------+---------+------------------+------------------------------+-----------+------------------------------------+------------------------------------+------------------------------------ 59791 | lzl | pg_recvlogical | 2024-04-08 21:14:19.56364+08 | streaming | (000000010000000000000001,6612032) | (000000010000000000000001,6612032) | (000000010000000000000001,6612032) Since the problem is caused by state.tmp, just touch it under pg_replslot:\n[postgres@testhost logical_test]$ pwd /pgdata/lzl/data11/pg_replslot/logical_test pg_recvlogical immediately errors:\npg_recvlogical: unexpected termination of replication stream: ERROR: could not create file \u0026#34;pg_replslot/logical_test/state.tmp\u0026#34;: File exists Manual CHECKPOINT hangs:\nlzldb=# checkpoint; --hang Now check the walsender and session states:\npostgres=\u0026gt; select * from pg_stat_activity ; datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type -------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+------------------------------ -+-------------------------------+-----------------+---------------------+--------+-------------+--------------+--------------------------------------------------------+------------------------------ ... | | Activity | LogicalLauncherMain | | | | | logical replication launcher | 2024-04-08 21:25:55.058523+08 | | | active | | | checkpoint; | client backend 16384 | lzldb | 77638 | 16385 | lzl | pg_recvlogical | 127.0.0.1 | | 56928 | 2024-04-08 21:25:17.495833+08 | | 2024-04-08 21:25:17.497754+08 | 2024-04-08 21:25:17.498329+08 | LWLock | replication_slot_io | active | | | SELECT pg_catalog.set_config(\u0026#39;search_path\u0026#39;, \u0026#39;\u0026#39;, false) | walsender | | LWLock | replication_slot_io | | | | | checkpointer Perfectly reproduced — two replication_slot_io wait events.\nPG 12.3 Code Fix # //Here showing 15.3, which has an extra save_errno vs 12.3 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel) {\tfd = OpenTransientFile(tmppath, O_CREAT | O_EXCL | O_WRONLY | PG_BINARY); if (fd \u0026lt; 0) { /* * If not an ERROR, then release the lock before returning. In case * of an ERROR, the error recovery path automatically releases the * lock, but no harm in explicitly releasing even in that case. Note * that LWLockRelease() could affect errno. */ int\tsave_errno = errno; LWLockRelease(\u0026amp;slot-\u0026gt;io_in_progress_lock); errno = save_errno; ereport(elevel, (errcode_for_file_access(), errmsg(\u0026#34;could not create file \\\u0026#34;%s\\\u0026#34;: %m\u0026#34;, tmppath))); return; } ... LWLockRelease(\u0026amp;slot-\u0026gt;io_in_progress_lock); }\tIn every if branch, LWLockRelease is called before returning. This eliminates the logical vulnerability where the LWLock is not released in certain code paths. The code is clearly more robust.\nSolution Analysis # Deleting state.tmp won\u0026rsquo;t help — the LWLock is already held; the file was just the trigger. Restarting the replication link or killing the downstream won\u0026rsquo;t help — the checkpointer is the one holding the LWLock. The checkpointer cannot be killed directly. The only solution in this state is a force restart to perform instance recovery. A normal shutdown is impossible because CHECKPOINT is blocked. The ultimate fix: upgrade to PG 12.3 or later. (I also tried using gdb to call LWLockRelease with the LWLock address from pstack — it crashed the test instance immediately. Not recommended.)\nSummary # Logical replication is one of the most significant feature enhancements in recent PostgreSQL releases. Early versions did have many issues and pitfalls. PostgreSQL\u0026rsquo;s ambitious logical replication approach shows genuine innovation, and the community continuously refines and strengthens it — nearly every minor release includes many logical replication updates. This case is a real-world example: the logical replication code is clearly becoming more robust.\nLogical replication has a lot of depth. Recommended reading: PG Inner Workings: Logical Replication\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/case-study-logical-replication-deadlocks-checkpoint-walsender-and-backup/","section":"Posts","summary":"Problem Symptoms # The backup process (pg_start_backup()) was blocked by the checkpointer, and the checkpointer was blocked by the logical replication walsender. The database was still serving queries, but backup, checkpoint, and logical replication were all completely hung.\nTwo processes in pg_stat_activity showed an unusual wait event: replication_slot_io.\n[postgres@hostlzl:6666/postgres][04-08.16:50:28]=\u003e select * from pg_stat_activity where pid=173038 \\gx -[ RECORD 1 ]----+------------------------------ datid | 17630 datname | lzldb pid | 173038 usesysid | 35157 usename | repuser application_name | PostgreSQL JDBC Driver client_addr | 30.88.75.58 client_hostname | [null] client_port | 37623 backend_start | 2024-04-02 11:40:07.75022+08 xact_start | [null] query_start | [null] state_change | 2024-04-02 11:40:07.764475+08 wait_event_type | LWLock wait_event | replication_slot_io state | active backend_xid | [null] backend_xmin | [null] query | backend_type | walsender Time: 6.658 ms [postgres@hostlzl:6666/postgres][04-08.16:50:34]=\u003e select * from pg_stat_activity where pid=12729\\gx -[ RECORD 1 ]----+------------------------------ datid | [null] datname | [null] pid | 12729 usesysid | [null] usename | [null] application_name | client_addr | [null] client_hostname | [null] client_port | [null] backend_start | 2024-04-02 11:23:03.343116+08 xact_start | [null] query_start | [null] state_change | [null] wait_event_type | LWLock wait_event | replication_slot_io state | [null] backend_xid | [null] backend_xmin | [null] query | backend_type | checkpointer One walsender and one checkpointer. Both were started on April 2. Let’s check the walsender 173038 logs:\n","title":"Case Study: Logical Replication Deadlocks Checkpoint, Walsender, and Backup","type":"posts"},{"content":" The Phenomenon # Case: The execution plan changed and chose the wrong index, causing SQL performance to degrade from milliseconds to seconds. After collecting statistics, the business SQL was still slow. Ultimately, the problem was resolved by dropping the DAILY_DATE time index and creating a composite index on (DAILY_DATE, A_ID).\nQuestions:\nWhy did the optimizer choose the DAILY_DATE index instead of the more selective A_ID index? Why did collecting statistics have no effect? Stale Statistics # -- Simplified SQL select * from tablzl where A_ID = $1 AND IS_DELETE = \u0026#39;N\u0026#39; AND DAILY_DATE = to_date($2, \u0026#39;yyyyMMdd\u0026#39;) and PARTITION_KEY \u0026gt;= $3 and PARTITION_KEY \u0026lt;= $4 The optimizer chose the DAILY_DATE index instead of the more selective A_ID index:\nAppend (cost=0.44..8.83 rows=2 width=204) -\u0026gt; Index Scan using tablzl_p202401_DAILY_DATE_idx on tablzl_p202401 tablzl_1 (cost=0.44..5.47 rows=1 width=203) Index Cond: (DAILY_DATE = to_date(\u0026#39;20240223\u0026#39;::text, \u0026#39;yyyyMMdd\u0026#39;::text)) Filter: ((partition_key \u0026gt;= 202401) AND (partition_key \u0026lt;= 202402) AND ((A_ID)::text = \u0026#39;ID1234567890987654321\u0026#39;::text) AND ((is_delete)::text = \u0026#39;N\u0026#39;::text)) -\u0026gt; Index Scan using tablzl_p202402_DAILY_DATE_idx on tablzl_p202402 tablzl_2 (cost=0.44..3.35 rows=1 width=204) Index Cond: (DAILY_DATE = to_date(\u0026#39;20240223\u0026#39;::text, \u0026#39;yyyyMMdd\u0026#39;::text)) Filter: ((partition_key \u0026gt;= 202401) AND (partition_key \u0026lt;= 202402) AND ((A_ID)::text = \u0026#39;ID1234567890987654321\u0026#39;::text) AND ((is_delete)::text = \u0026#39;N\u0026#39;::text)) For the p202401 partition, whether it uses the DAILY_DATE or A_ID index doesn\u0026rsquo;t make much difference, because the January partition has no data for February 23. For the p202402 partition, whether it uses the DAILY_DATE or A_ID index makes a huge difference. Using the DAILY_DATE index, its estimated cost is 3.35 with rows=1, but in reality there are millions of rows, causing it to run for 2 seconds. The statistics for p202402 contain MCV (Most Common Values):\n= select * from pg_stats where tablename=\u0026#39;tablzl_p202402\u0026#39; and attname=\u0026#39;DAILY_DATE\u0026#39; \\gx most_common_vals | {2024-02-21,2024-02-20,2024-02-22,2024-02-10,2024-02-15,2024-02-19,2024-02-16,2024-02-18,2024-02-17,2024-02-14,2024-02-11,2024-02-07,2024-02-12,2024-02-06,2024-02-08,2024-. |.02-09,2024-02-03,2024-02-05,2024-02-01,2024-02-02,2024-01-31,2024-02-13,2024-02-04} most_common_freqs | {0.0481,0.047766667,0.0466,0.0449,0.0441,0.043833334,0.043733332,0.043466665,0.043133333,0.043066666,0.042366665,0.041866668,0.041366667,0.041366667,0.039766666,0.0394,0.039333332,0.. |.038766667,0.03863333,0.0381,0.038066667,0.037966665,0.037566666,0.036733333} Calculate the sum of MCV frequencies:\n= select 0.0481+0.047766667+0.0466+0.0449+0.0441+0.043833334+0.043733332+0.043466665+0.043133333+0.043066666+0.042366665+0.041866668+0.041366667+0.041366667+0.039766666+0.0394+0.039333332+0.038766667+0.03863333+0.0381+0.038066667+0.037966665+0.037566666+0.036733333; ?column? ------------- 0.999999990 It\u0026rsquo;s exactly 1, meaning the planner estimates that days 1-22 represent all the data in this partition, and day 23 should have 0 rows. So when estimating rows for day 23 data, the planner assumes rows=1, and thus chooses the DAILY_DATE index. In reality, day 23 had millions of rows.\nEssentially, this is a problem caused by stale statistics. Why were the first 22 days fine, and why didn\u0026rsquo;t day 23 trigger automatic collection?\n= select relname,reloptions from pg_class where relname=\u0026#39;tablzl\u0026#39;; relname | reloptions ----------------------------+------------ tablzl | [null] = show autovacuum_analyze_scale_factor; autovacuum_analyze_scale_factor --------------------------------- 0.1 The trigger threshold defaults to 0.1 — auto-ANALYZE only triggers when data changes reach 1/10. This is a monthly partition, with data inserted and updated daily. Early in the month, writing 2 million rows per day would trigger multiple ANALYZEs (the threshold of 50 can be ignored), but at month end, for example on day 23, writing 2 million rows would not trigger ANALYZE because only 1/23 of the data changed. In this scenario, data was also updated after insertion — 2 million inserts and 2 million updates — so the data change on day 23 was about 1/11, just barely not triggering ANALYZE. This also explains why the first 20 days ran stably.\nAdditionally, since the data change threshold is a ratio, as long as the daily data change volume is relatively uniform, this month-end statistics inaccuracy problem will always occur!\nExecution Plan Caching # Since this was a stale statistics problem, manually collecting statistics should have resolved it. In practice, however, after collection, the business SQL was still slow.\nAfter running ANALYZE, manual EXPLAIN ANALYZE showed the correct execution plan.\nThis indicated that ANALYZE should have helped, but it didn\u0026rsquo;t affect the business sessions. Since the SQL execution used long-lived sessions, I suspected that the JDBC driver was using prepared statements to cache execution plans (JDBC PreparedStatement).\nIn PostgreSQL 13 (RasesQL 1.3), collecting statistics does not invalidate prepared statements; re-parsing only happens by reconnecting the session.\nPrepared statements generate a generic execution plan. Due to inaccurate statistics, the generic execution plan, like the parameter-specific execution plan, could choose the wrong index.\nCharacteristics of Prepared Statements # psql supports prepared statements, controlled by the plan_cache_mode parameter:\nauto: default, uses the five-execution mechanism force_custom_plan: always performs hard parsing, generating a custom plan force_generic_plan: always uses the generic plan with bound variables Syntax:\nPREPARE plan1(text,integer) AS select * from tlzl1 where id=$1 and month=$2; EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;11\u0026#39;); deallocate plan1|all; -- invalidates the prepared statement; disconnecting also works View: (basically useless since it\u0026rsquo;s local — you can\u0026rsquo;t see anything in production)\nselect * from pg_prepared_statements; How Generic Plans Are Generated # Normally, a prepared statement can generate a generic plan after running 5 times. There are many demonstrations online, so I won\u0026rsquo;t demonstrate the normal case here. Below are the \u0026ldquo;magical\u0026rdquo; phenomena I observed during testing:\n-- Prepare data create table tlzl1(id varchar(50),month int); INSERT INTO tlzl1 SELECT md5(g::text),EXTRACT(month FROM g) FROM generate_series(\u0026#39;2023-01-01\u0026#39;::date, \u0026#39;2023-11-30\u0026#39;::date, \u0026#39;1 minute\u0026#39;) as g; create index idx_id on tlzl1(id); create index idx_month on tlzl1(month); analyze tlzl; -- Execute prepared statement PREPARE plan1(text,integer) AS select * from tlzl1 where id=$1 and month=$2; EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;11\u0026#39;); explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;11\u0026#39;); Note that only data before December was inserted — December has no data. At this point, querying December data can use the month index:\n=# explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;12\u0026#39;); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Index Scan using idx_month on tlzl1 (cost=0.42..2.94 rows=1 width=37) (actual time=0.035..0.036 rows=0 loops=1) Index Cond: (month = 12) Filter: ((id)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) Planning Time: 0.170 ms Execution Time: 0.058 ms (5 rows) Time: 0.551 ms =# explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;12\u0026#39;); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Index Scan using idx_month on tlzl1 (cost=0.42..2.94 rows=1 width=37) (actual time=0.021..0.021 rows=0 loops=1) Index Cond: (month = 12) Filter: ((id)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) Planning Time: 0.168 ms Execution Time: 0.046 ms (5 rows) Time: 0.488 ms =# explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;12\u0026#39;); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Index Scan using idx_month on tlzl1 (cost=0.42..2.94 rows=1 width=37) (actual time=0.017..0.018 rows=0 loops=1) Index Cond: (month = 12) Filter: ((id)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) Planning Time: 0.157 ms Execution Time: 0.040 ms (5 rows) Time: 0.419 ms =# explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;12\u0026#39;); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Index Scan using idx_month on tlzl1 (cost=0.42..2.94 rows=1 width=37) (actual time=0.019..0.020 rows=0 loops=1) Index Cond: (month = 12) Filter: ((id)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) Planning Time: 0.160 ms Execution Time: 0.044 ms (5 rows) Time: 0.479 ms =# explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;12\u0026#39;); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Index Scan using idx_month on tlzl1 (cost=0.42..2.94 rows=1 width=37) (actual time=0.018..0.018 rows=0 loops=1) Index Cond: (month = 12) Filter: ((id)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) Planning Time: 0.155 ms Execution Time: 0.041 ms (5 rows) Time: 0.426 ms -- Sixth execution =# explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;12\u0026#39;); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Index Scan using idx_id on tlzl1 (cost=0.42..5.44 rows=1 width=37) (actual time=0.044..0.045 rows=0 loops=1) Index Cond: ((id)::text = $1) Filter: (month = $2) Rows Removed by Filter: 1 Planning Time: 0.023 ms Execution Time: 0.079 ms (6 rows) On the sixth execution, the generic plan was bound — but it wasn\u0026rsquo;t the same plan as the first five executions; it used the id index. If id had even higher cardinality, you could also observe cases where the generic plan simply couldn\u0026rsquo;t be bound (not shown here).\nLet\u0026rsquo;s look at the source code:\nchoose_custom_plan:\nstatic bool choose_custom_plan(CachedPlanSource *plansource, ParamListInfo boundParams) { ... /* Generate custom plans until we have done at least 5 (arbitrary) */ if (plansource-\u0026gt;num_custom_plans \u0026lt; 5) return true; avg_custom_cost = plansource-\u0026gt;total_custom_cost / plansource-\u0026gt;num_custom_plans; /* * Prefer generic plan if it\u0026#39;s less expensive than the average custom * plan. (Because we include a charge for cost of planning in the * custom-plan costs, this means the generic plan only has to be less * expensive than the execution cost plus replan cost of the custom * plans.) * * Note that if generic_cost is -1 (indicating we\u0026#39;ve not yet determined * the generic plan cost), we\u0026#39;ll always prefer generic at this point. */ if (plansource-\u0026gt;generic_cost \u0026lt; avg_custom_cost) return false; return true; }\tAs long as the generic plan\u0026rsquo;s cost is less than the average cost of the first 5 custom plans, the generic plan is used.\nWhile the 5-execution mechanism is well-known, it\u0026rsquo;s important to note how the generic plan is generated. On the 5th execution, there is no generic plan yet (initially, generic_cost=-1), so it directly goes to the !customplan logic in GetCachedPlan:\nCachedPlan * GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, bool useResOwner, QueryEnvironment *queryEnv) { ... customplan = choose_custom_plan(plansource, boundParams); if (!customplan) { if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ plan = plansource-\u0026gt;gplan; Assert(plan-\u0026gt;magic == CACHEDPLAN_MAGIC); } else { /* Build a new generic plan */ plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource-\u0026gt;gplan is clear */ ReleaseGenericPlan(plansource); /* Link the new generic plan into the plansource */ plansource-\u0026gt;gplan = plan; plan-\u0026gt;refcount++; /* Immediately reparent into appropriate context */ if (plansource-\u0026gt;is_saved) { /* saved plans all live under CacheMemoryContext */ MemoryContextSetParent(plan-\u0026gt;context, CacheMemoryContext); plan-\u0026gt;is_saved = true; } else { /* otherwise, it should be a sibling of the plansource */ MemoryContextSetParent(plan-\u0026gt;context, MemoryContextGetParent(plansource-\u0026gt;context)); } /* Update generic_cost whenever we make a new generic plan */ plansource-\u0026gt;generic_cost = cached_plan_cost(plan, false); /* * If, based on the now-known value of generic_cost, we\u0026#39;d not have * chosen to use a generic plan, then forget it and make a custom * plan. This is a bit of a wart but is necessary to avoid a * glitch in behavior when the custom plans are consistently big * winners; at some point we\u0026#39;ll experiment with a generic plan and * find it\u0026#39;s a loser, but we don\u0026#39;t want to actually execute that * plan. */ customplan = choose_custom_plan(plansource, boundParams); /* * If we choose to plan again, we need to re-copy the query_list, * since the planner probably scribbled on it. We can force * BuildCachedPlan to do that by passing NIL. */ qlist = NIL; } } ... return plan; }\tIn the !customplan logic, if a generic plan already exists, use it directly. If not, generate one via BuildCachedPlan, which is the main logic for generating plans — converting a query tree into a plan tree.\nWhat about parameters? As the comments explain, pass NULL when there are no parameters to enter the plan generation logic:\nTo build a generic, parameter-value-independent plan, pass NULL for * boundParams. To build a custom plan, pass the actual parameter values via * boundParams What execution plan does the optimizer prefer when NULL is passed? This part of the code logic is somewhat complex. From the optimizer\u0026rsquo;s perspective, there may be multiple plans to choose from, but one must be selected as the generic plan.\nAnd that selected generic plan is what gets compared against the cost of the first 5 plans.\nWhy didn\u0026rsquo;t repeatedly executing a lower-cost plan produce the desired generic plan?\nWhat the generic plan looks like has nothing to do with the first five execution plans — the first five only determine whether this generic plan gets bound.\nFrom an optimizer design perspective, generic plans are meant to reduce parsing time and improve SQL execution efficiency, suitable for many small queries. The problem is that generic plans themselves are crude, and PostgreSQL introduced the five-execution mechanism precisely to reduce the likelihood of a generic plan being terrible.\nEven with the five-execution mechanism, the reasons a bad generic plan still gets bound are:\nGeneric plans are plans too, and they can inherently be bad Statistics are inaccurate, so the generic plan\u0026rsquo;s cost estimate is very low The first five executions had low selectivity (or other factors) causing high custom plan costs Prepared Statement Invalidation # Besides DDL, DEALLOCATE, and disconnecting sessions, collecting statistics can also invalidate prepared statements — but this is a PostgreSQL 14 feature.\nPostgreSQL 13:\nPostgreSQL will force re-analysis and re-planning of the statement before using it whenever database objects used in the statement have undergone definitional (DDL) changes since the previous use of the prepared statement\nPostgreSQL 14:\nPostgreSQL will force re-analysis and re-planning of the statement before using it whenever database objects used in the statement have undergone definitional (DDL) changes or their planner statistics have been updated since the previous use of the prepared statement\nTest confirming that in PostgreSQL 13, collecting statistics does not invalidate prepared statements:\n= explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;11\u0026#39;); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Index Scan using idx_id on tlzl1 (cost=0.42..5.44 rows=1 width=37) (actual time=0.033..0.033 rows=0 loops=1) Index Cond: ((id)::text = $1) Filter: (month = $2) Rows Removed by Filter: 1 Planning Time: 0.098 ms Execution Time: 0.050 ms (6 rows) = select * from pg_prepared_statements; name | statement | prepare_time | parameter_types | from_sql -------+-----------------------------------------------+-------------------------------+-----------------+---------- plan1 | PREPARE plan1(text,integer) AS +| 2024-02-29 14:27:59.966733+08 | {text,integer} | t | select * from tlzl1 where id=$1 and month=$2; | | | (1 row) = analyze tlzl1; ANALYZE = select * from pg_prepared_statements; name | statement | prepare_time | parameter_types | from_sql -------+-----------------------------------------------+-------------------------------+-----------------+---------- plan1 | PREPARE plan1(text,integer) AS +| 2024-02-29 14:27:59.966733+08 | {text,integer} | t | select * from tlzl1 where id=$1 and month=$2; | | | (1 row) = explain analyze EXECUTE plan1(\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;,\u0026#39;11\u0026#39;); QUERY PLAN --------------------------------------------------------------------------------------------------------------- Index Scan using idx_id on tlzl1 (cost=0.42..5.44 rows=1 width=37) (actual time=0.051..0.052 rows=0 loops=1) Index Cond: ((id)::text = $1) Filter: (month = $2) Rows Removed by Filter: 1 Planning Time: 0.022 ms Execution Time: 0.098 ms (6 rows) JDBC Prepared Statements # Prepared statements are not unique to PostgreSQL — other databases also have similar pre-parsing features. For example, Oracle can achieve similar functionality.\nJDBC itself can call the database\u0026rsquo;s pre-parsing interface and directly use prepared statements.\nExample JDBC configuration:\nString sql = \u0026#34;select * from people where id=?\u0026#34;; PreparedStatement preparedStatement = connection.prepareStatement(sql); Recommendations # Reduce the table-level autovacuum_analyze_scale_factor to 0.02 (why 0.02? Because 0.02 \u0026lt; 1/31). Since data is written and queried simultaneously, manual collection timing is hard to get right; reducing autovacuum_analyze_scale_factor can only mitigate this problem. Consider removing the PREPARE setting in JDBC, or set force_custom_plan. Adjust the SQL logic. Adjust indexes: 4.1 Remove unnecessary time indexes; 4.2 Rebuild the index that gets chosen after predicate out-of-bounds as a composite index that includes the id field (a good suggestion). Emergency procedure: If business performance doesn\u0026rsquo;t recover after statistics collection, and you\u0026rsquo;ve confirmed the execution plan has changed via manual EXPLAIN, consider killing sessions (for versions before 13). Finally, predicate out-of-bounds problems exist in essentially all databases, especially on time-based fields. There is currently no simple yet perfectly effective solution. Oracle\u0026rsquo;s SPM (SQL Plan Management) gains another point in my favorability\u0026hellip;\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/case-study-predicate-out-of-bounds-and-prepared-statement-issues-in-postgresql/","section":"Posts","summary":"The Phenomenon # Case: The execution plan changed and chose the wrong index, causing SQL performance to degrade from milliseconds to seconds. After collecting statistics, the business SQL was still slow. Ultimately, the problem was resolved by dropping the DAILY_DATE time index and creating a composite index on (DAILY_DATE, A_ID).\nQuestions:\nWhy did the optimizer choose the DAILY_DATE index instead of the more selective A_ID index? Why did collecting statistics have no effect? Stale Statistics # -- Simplified SQL select * from tablzl where A_ID = $1 AND IS_DELETE = 'N' AND DAILY_DATE = to_date($2, 'yyyyMMdd') and PARTITION_KEY \u003e= $3 and PARTITION_KEY \u003c= $4 The optimizer chose the DAILY_DATE index instead of the more selective A_ID index:\n","title":"Case Study: Predicate Out-of-Bounds and Prepared Statement Issues in PostgreSQL","type":"posts"},{"content":" How Did a Primary Key Query Access Multiple Data Pages? # Continuing from the previous article: A Classic Case of Long Transactions, Table Bloat, and LIMIT Problems, there was one point not explained in detail:\nWhy does a query using the primary key generate so many shared hits? Why does index bloat cause access to multiple data pages? Can\u0026rsquo;t data outside the page be located through the corresponding index entry? This relates to index version management — indexes do carry some version information, but not much. Let\u0026rsquo;s first review PostgreSQL\u0026rsquo;s btree index structure.\n（https://en.wikibooks.org/wiki/PostgreSQL/Index_Btree）\nThis PG btree wiki diagram doesn\u0026rsquo;t explain how dead tuples and dead index entries are accessed — it lacks version information. For now, you don\u0026rsquo;t need to understand every detail of this structure; just know that a btree structure like this exists.\nTo investigate the btree version access problem, let\u0026rsquo;s run a test:\ncreate table tab1(a bigserial,b char(1000)); create index idx_tab1_a on tab1(a); alter table tab1 set (autovacuum_enabled = off); --disable autovacuum alter table tab1 alter column b set storage PLAIN; --disable toast lzldb=\u0026gt; insert into tab1(b) values(\u0026#39;zzzzzzzzz\u0026#39;); INSERT 0 1 --View tuple info on the data page lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab1\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+--------------------------------------+---------------- (0,1) | 1 | LP_NORMAL | 111875 | 0 | 0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID} | {} --View index entry info on the index page (note: index page 0 is the meta page, has no data) lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab1_a\u0026#39;,1); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | some_tids ------------+-------+---------+-------+------+-------------------------+------+-------+----------- 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (0,1) | Only one row inserted: data page 0 has only 1 tuple, index page 1 has only one entry pointing to ctid(0,1).\nlzldb=\u0026gt; update tab1 set b=\u0026#39;xxxxxxx\u0026#39; ; UPDATE 1 lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab1\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-------------------------------------------------------------------+---------------- (0,2) | 1 | LP_NORMAL | 111875 | 111876 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_HOT_UPDATED} | {} (0,2) | 2 | LP_NORMAL | 111876 | 0 | 0 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} (2 rows) lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab1_a\u0026#39;,1); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | some_tids ------------+-------+---------+-------+------+-------------------------+------+-------+----------- 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (0,1) | After updating one row: data page 0 has 2 tuples. Only ctid(0,2) is alive. The tuple at lp=1 is \u0026ldquo;dead\u0026rdquo; but lp_flags is still \u0026ldquo;NORMAL\u0026rdquo;! Index page 1 still has only one entry pointing to ctid(0,1), which is the \u0026ldquo;dead\u0026rdquo; tuple. This is the principle of HOT (Heap-Only Tuple): when updating within the same page, the index entry is not updated. The index follows the ctid chain from the dead tuple to find the truly alive data tuple.\nLet\u0026rsquo;s update 10 times in a loop, producing 2 data pages and 1 index page:\nDO $$ begin FOR i IN 1..10 LOOP update tab1 set b=md5(i::text); END LOOP; end $$;; After updates:\n--First data page lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab1\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flag s --------+----+-------------+--------+--------+-------+--------------------------------------------------------------------------------------+-------------- -- | 1 | LP_REDIRECT | | | | | (0,3) | 2 | LP_NORMAL | 111876 | 111877 | 0 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (0,4) | 3 | LP_NORMAL | 111877 | 111877 | 0 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (0,5) | 4 | LP_NORMAL | 111877 | 111877 | 1 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (0,6) | 5 | LP_NORMAL | 111877 | 111877 | 2 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (0,7) | 6 | LP_NORMAL | 111877 | 111877 | 3 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (1,1) | 7 | LP_NORMAL | 111877 | 111877 | 4 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} (7 rows) --Second data page lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab1\u0026#39;,1)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+--------------------------------------------------------------------------------+---------------- (1,2) | 1 | LP_NORMAL | 111877 | 111877 | 5 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED} | {} (1,3) | 2 | LP_NORMAL | 111877 | 111877 | 6 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (1,4) | 3 | LP_NORMAL | 111877 | 111877 | 7 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (1,5) | 4 | LP_NORMAL | 111877 | 111877 | 8 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} (1,5) | 5 | LP_NORMAL | 111877 | 0 | 9 | {HEAP_HASVARWIDTH,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} On the first data page (page 0), the LP_REDIRECT status directly tells us the page definitely has HOT chains. At lp=1 there is no other information — not even ctid, data, or infomask. You cannot trace through this lp to find the final data. For the first index entry, it\u0026rsquo;s sufficient to access ctid(0,1); there is no desired data row in this page. But data page 2 has no LP_REDIRECT, and the index can find the live tuple (1,5) within the page by following the ctid chain from ctid(1,0).\nSource code explanation of line pointer states:\n/* *lp_flags has these possible states. An UNUSED line pointer is available *for immediate re-use, the other states are not. */ #define LP_UNUSED\t0\t/* unused (should always have lp_len=0) */ #define LP_NORMAL\t1\t/* used (should always have lp_len\u0026gt;0) */ #define LP_REDIRECT\t2\t/* HOT redirect (should have lp_len=0), actually not HOT but cross-page redirect indicator */ #define LP_DEAD\t3\t/* dead, may or may not have storage */ //Explanation of LP_REDIRECT Redirecting line pointer A line pointer that points to another line pointer and has no associated tuple. It has the special lp_flags state LP_REDIRECT, and lp_off is the OffsetNumber of the line pointer it links to. This is used when a root tuple becomes dead but we cannot prune the line pointer because there are non-dead heap-only tuples further down the chain. Looking back more carefully, the lp status of what we consider \u0026ldquo;dead\u0026rdquo; tuples is LP_NORMAL, not LP_DEAD. This is important because we\u0026rsquo;ll revisit this point later.\nContinuing to examine the index page:\nlzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab1_a\u0026#39;,1); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | some_tids ------------+-------+---------+-------+------+-------------------------+------+-------+----------- 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (0,1) | 2 | (1,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1,1) | Because an additional page was created, HOT no longer applies. The index is updated. The index page has only 2 entries, both alive (dead=f), each pointing to the first tuple of its respective page: (0,1) and (1,1). For cross-page updates, the index page is also updated, with each index entry pointing to its own page. Note: at this point the table has only 1 row of data, but the index has 2 entries, both alive. This is why a primary key scan accesses multiple data pages.\nLet\u0026rsquo;s update more data to produce multiple index pages:\nDO $$ begin FOR i IN 1..10000 LOOP update tab1 set b=md5(i::text); END LOOP; end $$; --First index page lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab1_a\u0026#39;,1); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | some_tids ------------+-------------+---------+-------+------+-------------------------+------+----------+------------------------- 1 | (1278,4097) | 24 | f | f | 01 00 00 00 00 00 00 00 | | (1277,1) | 2 | (16,8414) | 1352 | f | f | 01 00 00 00 00 00 00 00 | f | (0,1) | {\u0026#34;(0,1)\u0026#34;,\u0026#34;(1,1)\u0026#34;} 3 | (16,8414) | 1352 | f | f | 01 00 00 00 00 00 00 00 | f | (222,1) | {\u0026#34;(222,1)\u0026#34;,\u0026#34;(223,1)\u0026#34;} 4 | (16,8414) | 1352 | f | f | 01 00 00 00 00 00 00 00 | f | (444,1) | {\u0026#34;(444,1)\u0026#34;,\u0026#34;(445,1)\u0026#34;} 5 | (16,8414) | 1352 | f | f | 01 00 00 00 00 00 00 00 | f | (666,1) | {\u0026#34;(666,1)\u0026#34;,\u0026#34;(667,1)\u0026#34;} 6 | (16,8414) | 1352 | f | f | 01 00 00 00 00 00 00 00 | f | (888,1) | {\u0026#34;(888,1)\u0026#34;,\u0026#34;(889,1)\u0026#34;} --Second index page lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab1_a\u0026#39;,2); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | some_tids ------------+----------+---------+-------+------+-------------------------+------+----------+----------- 1 | (1278,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1278,1) | 2 | (1279,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1279,1) | 3 | (1280,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1280,1) | 4 | (1281,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1281,1) | ... 152 | (1429,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1429,1) | 153 | (1430,1) | 16 | f | f | 01 00 00 00 00 00 00 00 | f | (1430,1) | (153 rows) --Third index page lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab1_a\u0026#39;,3); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | some_tids ------------+----------+---------+-------+------+-------------------------+------+----------+----------- 1 | (1,0) | 8 | f | f | | | | 2 | (2,4097) | 24 | f | f | 01 00 00 00 00 00 00 00 | | (1277,1) | There are 3 index pages total. Page 1 is the root node. Pages 2 and 3 are leaf nodes. The dead status of all their index entries is \u0026ldquo;f\u0026rdquo;.\nNow let\u0026rsquo;s return to the SQL, using the primary key index:\nlzldb=\u0026gt; explain (analyze,buffers) select * from tab1 where a=1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tab1 (cost=4.39..56.41 rows=14 width=4012) (actual time=2.594..2.596 rows=1 loops=1) Recheck Cond: (a = 1) Heap Blocks: exact=1 Buffers: shared hit=1437 dirtied=1026 -\u0026gt; Bitmap Index Scan on idx_tab1_a (cost=0.00..4.39 rows=14 width=0) (actual time=0.152..0.153 rows=1431 loops=1) Index Cond: (a = 1) Buffers: shared hit=6 Planning: Buffers: shared hit=5 Planning Time: 0.087 ms Execution Time: 2.614 ms When querying by primary key, shared hit is 1437, roughly matching the ~1430 table pages. Since indexes lack version information and the dead status of index entries hasn\u0026rsquo;t been updated, PostgreSQL follows all live index entries to find version information in the data pages. This is why a primary key index scan can be extremely slow.\nkill index item # Since indexes don\u0026rsquo;t store visibility information (i.e., MVCC version info), the visibility of the tuple pointed to by an index determines the index visibility itself. This is also why index-only scans in PostgreSQL still access data pages. Of course, with the visibility map (VM), the VM records which data pages are all-visible and all-frozen, so index-only scans won\u0026rsquo;t access those pages — they\u0026rsquo;re already visible.\nEven without VACUUM, the PostgreSQL kernel has a method for handling this kind of index bloat — kill index item. This feature is sometimes called Simple deletion or index deletion (terminology from src/backend/access/nbtree/README). Essentially, it marks index entries corresponding to tuples that are already LP_DEAD as dead, without changing the existing index structure.\nSource code function _bt_killitems:\n* _bt_killitems - set LP_DEAD state for items an indexscan caller has * told us were killed This clearly states that index scans trigger kill item operations (meaning SELECT can also trigger this operation to update the index). This is easy to test. Since our previous data has already been index-scanned, let\u0026rsquo;s rebuild data for testing.\ncreate table tab2(a bigserial,b char(100)); create index idx_tab2_a on tab2(a); create index idx_tab2_b on tab2(b); alter table tab2 set (autovacuum_enabled = off); --disable autovacuum alter table tab2 alter column b set storage PLAIN; --disable toast --Insert 1 row and update repeatedly insert into tab2(b) values(\u0026#39;00000\u0026#39;); DO $$ begin FOR i IN 1..10000 LOOP update tab2 set b=i::text; END LOOP; end $$; --Table pages lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab2\u0026#39;,2)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-----------------------------------------------+---------------- (2,2) | 1 | LP_NORMAL | 509 | 509 | 115 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} (2,3) | 2 | LP_NORMAL | 509 | 509 | 116 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} (2,4) | 3 | LP_NORMAL | 509 | 509 | 117 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} (2,5) | 4 | LP_NORMAL | 509 | 509 | 118 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} (2,6) | 5 | LP_NORMAL | 509 | 509 | 119 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} (2,7) | 6 | LP_NORMAL | 509 | 509 | 120 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} (2,8) | 7 | LP_NORMAL | 509 | 509 | 121 | {HEAP_HASVARWIDTH,HEAP_COMBOCID,HEAP_UPDATED} | {} ... --Index a pages lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_a\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+-----------+---------+-------+------+------+---------+----------------------- 1 | (66,4097) | 24 | f | f | | (66,6) | 2 | (16,8414) | 1352 | f | f | f | (44,5) | {\u0026#34;(44,5)\u0026#34;,\u0026#34;(44,6)\u0026#34;} 3 | (16,8414) | 1352 | f | f | f | (47,53) | {\u0026#34;(47,53)\u0026#34;,\u0026#34;(47,54)\u0026#34;} 4 | (16,8414) | 1352 | f | f | f | (51,43) | {\u0026#34;(51,43)\u0026#34;,\u0026#34;(51,44)\u0026#34;} 5 | (16,8414) | 1352 | f | f | f | (55,33) | {\u0026#34;(55,33)\u0026#34;,\u0026#34;(55,34)\u0026#34;} 6 | (16,8414) | 1352 | f | f | f | (59,23) | {\u0026#34;(59,23)\u0026#34;,\u0026#34;(59,24)\u0026#34;} 7 | (16,8360) | 1024 | f | f | f | (63,13) | {\u0026#34;(63,13)\u0026#34;,\u0026#34;(63,14)\u0026#34;} --Index b pages lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_b\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+---------+---------+-------+------+------+---------+----------- 1 | (57,1) | 112 | f | t | | | 2 | (0,34) | 112 | f | t | f | (0,34) | 3 | (5,41) | 112 | f | t | f | (5,41) | 4 | (56,53) | 112 | f | t | f | (56,53) | 5 | (56,54) | 112 | f | t | f | (56,54) | 6 | (56,55) | 112 | f | t | f | (56,55) | 7 | (56,56) | 112 | f | t | f | (56,56) | 8 | (56,57) | 112 | f | t | f | (56,57) | Now query the table with a sequential scan, then examine the data tuple and index entry states:\nlzldb=\u0026gt; explain (analyze,buffers) select * from tab2; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on tab2 (cost=0.00..204.14 rows=3114 width=412) (actual time=1.077..1.079 rows=1 loops=1) Buffers: shared hit=173 dirtied=173 Planning Time: 0.042 ms Execution Time: 1.090 ms lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab2\u0026#39;,4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+----------+--------+--------+-------+-----------+---------------- | 1 | LP_DEAD | | | | | | 2 | LP_DEAD | | | | | | 3 | LP_DEAD | | | | | | 4 | LP_DEAD | | | | | | 5 | LP_DEAD | | | | | | 6 | LP_DEAD | | | | | | 7 | LP_DEAD | | | | | lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_a\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+-----------+---------+-------+------+------+---------+----------------------- 1 | (66,4097) | 24 | f | f | | (66,6) | 2 | (16,8414) | 1352 | f | f | f | (44,5) | {\u0026#34;(44,5)\u0026#34;,\u0026#34;(44,6)\u0026#34;} 3 | (16,8414) | 1352 | f | f | f | (47,53) | {\u0026#34;(47,53)\u0026#34;,\u0026#34;(47,54)\u0026#34;} 4 | (16,8414) | 1352 | f | f | f | (51,43) | {\u0026#34;(51,43)\u0026#34;,\u0026#34;(51,44)\u0026#34;} 5 | (16,8414) | 1352 | f | f | f | (55,33) | {\u0026#34;(55,33)\u0026#34;,\u0026#34;(55,34)\u0026#34;} 6 | (16,8414) | 1352 | f | f | f | (59,23) | {\u0026#34;(59,23)\u0026#34;,\u0026#34;(59,24)\u0026#34;} 7 | (16,8360) | 1024 | f | f | f | (63,13) | {\u0026#34;(63,13)\u0026#34;,\u0026#34;(63,14)\u0026#34;} (7 rows) lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_b\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+---------+---------+-------+------+------+---------+----------- 1 | (57,1) | 112 | f | t | | | 2 | (0,34) | 112 | f | t | f | (0,34) | 3 | (5,41) | 112 | f | t | f | (5,41) | 4 | (56,53) | 112 | f | t | f | (56,53) | 5 | (56,54) | 112 | f | t | f | (56,54) | 6 | (56,55) | 112 | f | t | f | (56,55) | 7 | (56,56) | 112 | f | t | f | (56,56) | Data tuples: all pages except the last were marked LP_DEAD. Index entries: nothing changed.\nNow query again using index a:\nlzldb=\u0026gt; explain (analyze,buffers) select * from tab2 where a=1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Index Scan using idx_tab2_a on tab2 (cost=0.28..68.56 rows=16 width=412) (actual time=1.282..1.510 rows=1 loops=1) Index Cond: (a = 1) Buffers: shared hit=190 dirtied=8 Planning Time: 0.058 ms Execution Time: 1.525 ms (5 rows) lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab2\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+----------+--------+--------+-------+-----------+---------------- | 1 | LP_DEAD | | | | | | 2 | LP_DEAD | | | | | | 3 | LP_DEAD | | | | | | 4 | LP_DEAD | | | | | | 5 | LP_DEAD | | | | | | 6 | LP_DEAD | | | | | lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_a\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+-----------+---------+-------+------+------+---------+----------------------- 1 | (66,4097) | 24 | f | f | | (66,6) | 2 | (16,8414) | 1352 | f | f | t | (44,5) | {\u0026#34;(44,5)\u0026#34;,\u0026#34;(44,6)\u0026#34;} 3 | (16,8414) | 1352 | f | f | t | (47,53) | {\u0026#34;(47,53)\u0026#34;,\u0026#34;(47,54)\u0026#34;} 4 | (16,8414) | 1352 | f | f | t | (51,43) | {\u0026#34;(51,43)\u0026#34;,\u0026#34;(51,44)\u0026#34;} 5 | (16,8414) | 1352 | f | f | t | (55,33) | {\u0026#34;(55,33)\u0026#34;,\u0026#34;(55,34)\u0026#34;} 6 | (16,8414) | 1352 | f | f | t | (59,23) | {\u0026#34;(59,23)\u0026#34;,\u0026#34;(59,24)\u0026#34;} 7 | (16,8360) | 1024 | f | f | t | (63,13) | {\u0026#34;(63,13)\u0026#34;,\u0026#34;(63,14)\u0026#34;} (7 rows) lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_b\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+---------+---------+-------+------+------+---------+----------- 1 | (57,1) | 112 | f | t | | | 2 | (0,34) | 112 | f | t | f | (0,34) | 3 | (5,41) | 112 | f | t | f | (5,41) | 4 | (56,53) | 112 | f | t | f | (56,53) | 5 | (56,54) | 112 | f | t | f | (56,54) | 6 | (56,55) | 112 | f | t | f | (56,55) | 7 | (56,56) | 112 | f | t | f | (56,56) | The dead tuples in index a have all been marked dead=t, while dead tuples in index b remain dead=f because we haven\u0026rsquo;t scanned index b.\nNow query through index a again:\nlzldb=\u0026gt; explain (analyze,buffers) select * from tab2 where a=1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Index Scan using idx_tab2_a on tab2 (cost=0.28..68.56 rows=16 width=412) (actual time=0.020..0.021 rows=1 loops=1) Index Cond: (a = 1) Buffers: shared hit=10 Planning Time: 0.059 ms Execution Time: 0.033 ms Because the index entries for dead tuples in index a have been marked dead=t, there\u0026rsquo;s no need to check version information on data pages to determine whether tuples are \u0026ldquo;alive.\u0026rdquo;\nWhy is shared hit=10 here, still somewhat high? Because kill index item only marks dead index entries without changing the index structure, so the number of index pages hasn\u0026rsquo;t decreased. These 10 shared hits correspond to 10 index pages (including the meta page).\nlzldb=\u0026gt; analyze tab2; ANALYZE lzldb=\u0026gt; select relname,relpages,reltuples from pg_class where relname=\u0026#39;idx_tab2_a\u0026#39;; relname | relpages | reltuples ------------+----------+----------- idx_tab2_a | 10 | 1 Bottom-Up deletion # In PG14, the trigger condition for index deletion was enhanced. As mentioned earlier, index deletion is triggered by scanning the index. In PG14, index deletion can also be triggered when an index page split is imminent, to find free index space and reduce the probability of page splits.\nThis feature reduces index splits and thus also reduces index bloat, mitigating the problems caused by index bloat.\nFor specific testing, see: INDEX BLOAT REDUCED IN POSTGRESQL V14\nindex deduplication # PG13 introduced the index deduplication feature, which brings the GIN index posting list concept into btree indexes to reduce the space occupied by duplicate btree index entries and mitigate index split issues.\nPreviously, btree index entries pointed to only one ctid (as we saw in the tests above). With deduplicate index items, one index entry can have a posting list, and one posting list can hold multiple ctids.\nThe representation of posting lists is almost identical to the posting lists used by GIN\nLike GIN posting tree(list) (the btree posting list may not exactly follow this structure — needs further study):\n（https://postgrespro.com/blog/pgsql/4261647）\nTesting index deduplication:\ncreate table tab3(same char(100),diff char(100)); create index idx_tab3_same on tab3(same); create index idx_tab3_diff on tab3(diff); insert into tab3 select 10000::text,i::text from generate_series(10000, 99999) as i; lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab3_same\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+------------+---------+-------+------+------+----------+----------------------- 1 | (104,4097) | 120 | f | t | | (104,10) | 2 | (112,8398) | 1352 | f | t | f | (69,19) | {\u0026#34;(69,19)\u0026#34;,\u0026#34;(69,20)\u0026#34;} 3 | (112,8398) | 1352 | f | t | f | (75,21) | {\u0026#34;(75,21)\u0026#34;,\u0026#34;(75,22)\u0026#34;} 4 | (112,8398) | 1352 | f | t | f | (81,23) | {\u0026#34;(81,23)\u0026#34;,\u0026#34;(81,24)\u0026#34;} 5 | (112,8398) | 1352 | f | t | f | (87,25) | {\u0026#34;(87,25)\u0026#34;,\u0026#34;(87,26)\u0026#34;} 6 | (112,8398) | 1352 | f | t | f | (93,27) | {\u0026#34;(93,27)\u0026#34;,\u0026#34;(93,28)\u0026#34;} 7 | (112,8344) | 1024 | f | t | f | (99,29) | {\u0026#34;(99,29)\u0026#34;,\u0026#34;(99,30)\u0026#34;} (7 rows) lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab3_diff\u0026#39;,4); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+--------+---------+-------+------+------+--------+----------- 1 | (5,1) | 112 | f | t | | | 2 | (3,23) | 112 | f | t | f | (3,23) | 3 | (3,24) | 112 | f | t | f | (3,24) | ... 62 | (5,15) | 112 | f | t | f | (5,15) | 63 | (5,16) | 112 | f | t | f | (5,16) | (63 rows) The tids column in the bt_page_items function is essentially the posting list. The same field was inserted with identical data and produced deduplication in the index; the diff field had no duplicate data and produced no deduplication.\nThe space difference is enormous:\nlzldb=\u0026gt; select relname,relpages,reltuples from pg_class where relname like \u0026#39;idx_tab3%\u0026#39;; relname | relpages | reltuples ---------------+----------+----------- idx_tab3_diff | 1484 | 90000 idx_tab3_same | 81 | 90000 Can unique indexes produce deduplication? # Unique indexes have no duplicate data, so it seems like they wouldn\u0026rsquo;t. In practice, they can. Because even with unique indexes, when HOT can\u0026rsquo;t satisfy an update, multiple index entries are created. We can see this from the first test case in this article. Repeatedly updating a single row with UPDATE also produces deduplication, which occurs before delete index item.\nAdditionally, when delete index item removes a posting list index entry, it must ensure that all ctids under the posting list correspond to DEAD tuples.\nDisabling deduplication # Index deduplication was introduced in PG13. The feature is enabled by default and can be disabled at the index level. Modifying deduplicate_items on an index won\u0026rsquo;t directly change the existing index structure; it only affects newly inserted data.\nalter index idx_tab3_same set (deduplicate_items=off); create index idx_tab3_same1 on tab3(same) with (deduplicate_items=off); What does VACUUM do? # VACUUM does many things. Here we\u0026rsquo;ll only focus on table/index bloat and space reclamation, skipping wraparound and other topics.\nLet\u0026rsquo;s test with tab2, where we repeatedly updated a single row. Simple deletion has already been triggered, and table/index entries are almost all DEAD.\nRun VACUUM directly:\nlzldb=# vacuum verbose tab2; INFO: vacuuming \u0026#34;public.tab2\u0026#34; INFO: scanned index \u0026#34;idx_tab2_a\u0026#34; to remove 10000 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index \u0026#34;idx_tab2_b\u0026#34; to remove 10000 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: \u0026#34;tab2\u0026#34;: removed 10000 row versions in 173 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: index \u0026#34;idx_tab2_a\u0026#34; now contains 1 row versions in 10 pages DETAIL: 10000 index row versions were removed. 7 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index \u0026#34;idx_tab2_b\u0026#34; now contains 1 row versions in 276 pages DETAIL: 10000 index row versions were removed. 269 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: \u0026#34;tab2\u0026#34;: found 24 removable, 1 nonremovable row versions in 173 out of 173 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 526 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM idx_tab2_a removed 10000 row versions in 10 pages, 7 index pages were deleted. Table tab2 removed 10000 row versions in 173 pages.\n--First page of the table lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab2\u0026#39;,0)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags --------+----+-----------+--------+--------+-------+-----------+---------------- | 1 | LP_UNUSED | | | | | | 2 | LP_UNUSED | | | | | ... | 45 | LP_UNUSED | | | | | --Last page of the table lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags from heap_page_items(get_raw_page(\u0026#39;tab2\u0026#39;,172)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags ----------+----+-----------+--------+--------+-------+-----------------------------------------------------------------------+---------------- | 1 | LP_UNUSED | | | | | | 2 | LP_UNUSED | | | | | ... | 23 | LP_UNUSED | | | | | | 24 | LP_UNUSED | | | | | (172,25) | 25 | LP_NORMAL | 509 | 0 | 9999 | {HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED} | {} --First index page lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_a\u0026#39;,1); NOTICE: page is deleted itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+----------------+---------+-------+------+------+------+----------- 1 | (4294967295,0) | 8 | f | f | | | (1 row) --Last index page lzldb=\u0026gt; SELECT itemoffset, ctid, itemlen, nulls, vars, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idx_tab2_a\u0026#39;,9); itemoffset | ctid | itemlen | nulls | vars | dead | htid | some_tids ------------+----------+---------+-------+------+------+----------+----------- 1 | (172,25) | 16 | f | f | f | (172,25) | All line pointers for dead table tuples were marked UNUSED, data was cleaned, and only one live tuple remains in NORMAL state. The table still has the same number of pages.\nAll dead index entries (dead=t) were cleaned. Live index entries were shifted within index pages (the last page\u0026rsquo;s index entry originally had itemoffset != 1). All emptied index pages were marked as deleted. These deleted index pages still exist, in a half-dead state.\nFrom the nbtree README on \u0026ldquo;Deleting entire pages during VACUUM\u0026rdquo; (the original is quite long; I\u0026rsquo;ve excerpted the key parts):\nWe consider deleting an entire page from the btree only when it\u0026rsquo;s become completely empty of items. Page deletion always begins from an empty leaf page. An internal page can only be deleted as part of deleting an entire subtree.\nAn entire page is only considered for deletion when the index page is completely empty. Deletion always starts from leaf nodes; non-leaf nodes are only deleted when deleting an entire subtree.\nDeleting a leaf page is a two-stage process.\nIn the first stage, the page is unlinked from its parent, and marked as half-dead. In the second-stage, the half-dead leaf page is unlinked from its siblings. We first lock the left sibling (if any) of the target, the target page itself, and its right sibling (there must be one) in that order. Then we update the side-links in the siblings, and mark the target page deleted.\nDeleting a leaf page has two stages:\nUnlink from the parent — the leaf page is now in half-dead state Unlink from left and right siblings — the leaf page is now in deleted state A deleted page cannot be recycled immediately, since there may be other processes waiting to reference it (ie, search processes that just left the parent, or scans moving right or left from one of the siblings). These processes must be able to observe a deleted page for some time after the deletion operation, in order to be able to at least recover from it (they recover by moving right, as with concurrent page splits). Searchers never have to worry about concurrent page recycling.\nBecause other processes may still be using the deleted page, VACUUM cannot immediately recycle these index pages.\nThis description matches what we observed.\nAlthough after VACUUM, the index still has the same number of pages:\nrelname | relpages | reltuples ------------+----------+----------- idx_tab2_a | 10 | 1 tab2 | 173 | 1 The index scan no longer needs to access deleted pages:\nlzldb=\u0026gt; explain (analyze,buffers) select * from tab2 where a=1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Index Scan using idx_tab2_a on tab2 (cost=0.12..8.14 rows=1 width=109) (actual time=0.011..0.012 rows=1 loops=1) Index Cond: (a = 1) Buffers: shared hit=2 Planning Time: 0.056 ms Execution Time: 0.025 ms Before VACUUM, shared hit=10. After VACUUM, the number of index pages hasn\u0026rsquo;t changed — still 10, with 8 pages deleted but not directly recycled, so shared hit=2. Why 2 is easy to understand: \u0026ldquo;meta page\u0026rdquo; + \u0026ldquo;the one surviving leaf page.\u0026rdquo;\nPlacing deleted pages in the FSM # Recycling a page is decoupled from page deletion. A deleted page can only be put in the FSM to be recycled once there is no possible scan or search that has a reference to it; until then, it must stay in place with its sibling links undisturbed, as a tombstone that allows concurrent searches to detect and then recover from concurrent deletions (which are rather like concurrent page splits to searchers)\nWhat is \u0026ldquo;Placing deleted pages in the FSM\u0026rdquo;? After an index page is deleted, it isn\u0026rsquo;t directly recycled. During index splits or new page allocation, it\u0026rsquo;s hard to find deleted pages for reuse. Placing deleted pages in the FSM puts these recyclable pages into the index\u0026rsquo;s corresponding FSM file, making it easy to find available free pages.\nAs mentioned earlier, during the first VACUUM, those deleted pages are unlinked but still occupy space. Before PG14:\nWe implement the technique by waiting until all active snapshots and registered snapshots as of the page deletion are gone\nOne condition for deletion: all active snapshots and snapshots related to the deleted pages must have ended. So long transactions definitely affect placing.\nPlacing an already-deleted page in the FSM to be recycled when needed doesn\u0026rsquo;t actually change the state of the page. The page will be changed whenever it is subsequently taken from the FSM for reuse. The deleted page\u0026rsquo;s contents will be overwritten by the split operation (it will become the new right sibling page).\nAdditionally, putting an already-deleted page into the FSM file doesn\u0026rsquo;t change the page\u0026rsquo;s state — this is just to quickly locate available free pages.\nPrior to PostgreSQL 14, VACUUM would only place old deleted pages that it encounters during its linear scan (pages deleted by a previous VACUUM operation) in the FSM. Newly deleted pages were never placed in the FSM, because that was assumed to always be unsafe. PostgreSQL 14 added the ability for VACUUM to consider if it\u0026rsquo;s possible to recycle newly deleted pages at the end of the full index scan where the page deletion took place\nBefore PG14, deleted pages produced by the first VACUUM were not placed in the FSM. Only \u0026ldquo;old\u0026rdquo; deleted pages would be placed in the FSM file. Starting from PG14, the first VACUUM also considers placing deleted pages in the FSM.\nTest (my version is PG13):\nThe tab2 test above just ran one VACUUM. Although deleted pages were produced, the index has no corresponding FSM file:\nlzldb=\u0026gt; select * from pg_relation_filepath(\u0026#39;idx_tab2_a\u0026#39;); pg_relation_filepath ---------------------- base/16384/16437 [postgres@lzlhost data]$ ll base/16384/16437* -rw------- 1 postgres postgres 81920 Apr 5 11:04 base/16384/16437 Now run VACUUM again:\nlzldb=\u0026gt; vacuum tab2; [postgres@lzlhost data]$ ll base/16384/16437* -rw------- 1 postgres postgres 81920 Apr 5 11:04 base/16384/16437 -rw------- 1 postgres postgres 24576 Apr 5 15:52 base/16384/16437_fsm The index immediately generated an FSM file.\nFlowchart: Index Bloat and Cleanup # Please note:\nThe diagram below does not include table FSM/VM information The diagram below does not include deduplication information Version is PG13 fillfactor # Above we covered various kernel-supported methods for reducing index bloat. Beyond these approaches that require little active participation, you can also adjust table and index fillfactor to control bloat.\nFillfactor is essentially the waterline for tables or indexes. When INSERTING data, once the page reaches the fillfactor line, insertion moves to the next page. Fillfactor is designed to leave room for UPDATE operations, preventing UPDATE from frequently seeking new pages.\nAlthough both tables and indexes have fillfactor with the same goal (accommodating UPDATE), the details differ significantly:\nTables: If a table page still has space, UPDATE can happen within that page without needing to request a new page or go to another page with free space. Moreover, due to PostgreSQL\u0026rsquo;s unique HOT feature, in-page updates don\u0026rsquo;t update indexes, which naturally slows index bloat. Indexes: Different data rows or cross-page updates to the same row generate new index entries. Fillfactor leaves headroom in index pages, greatly reducing index split problems. Of course, fillfactor settings are closely tied to your workload. If data is like logs — monotonically increasing with zero updates — then setting both table and index fillfactor to 100 is reasonable. But most production tables have updates, and table/index fillfactor should not be 100. For frequent UPDATE workloads, fillfactor should be set even lower.\nHowever, PostgreSQL\u0026rsquo;s default fillfactor values are:\nTable default fillfactor=100 Index default fillfactor=90 With table fillfactor=100, HOT is completely unusable! Any UPDATE immediately seeks a new data page and creates a new index entry in the index\u0026rsquo;s 10% headroom. Eventually, update-heavy workloads constantly update indexes, and even 90 fillfactor on the index can\u0026rsquo;t hold up, leading to index splits\u0026hellip;\nHere\u0026rsquo;s a fillfactor test — two tables differ only in fillfactor, updating the same amount of data, comparing the final shared hit difference:\ncreate table tab4(a bigserial,b char(100)); create index idx_tab4_a on tab4(a); alter index idx_tab4_a set (deduplicate_items=off); --disable index deduplication alter table tab4 alter column b set storage PLAIN; --disable toast alter table tab4 set (autovacuum_enabled = off); --disable autovacuum --tab5 has the same definition as tab4, except table and index fillfactor are adjusted alter table tab5 set (fillfactor=70); alter index idx_tab5_a set (fillfactor=80); insert into tab4(b) values(\u0026#39;lllllllllll\u0026#39;); --Repeatedly update one row DO $$ begin FOR i IN 1..10000 LOOP update tab4 set b=md5(i::text) where a=1; END LOOP; end $$;; --Primary key query with default fillfactor lzldb=\u0026gt; explain (analyze,buffers) select * from tab4 where a=1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tab4 (cost=4.28..53.88 rows=16 width=412) (actual time=0.894..0.895 rows=1 loops=1) Recheck Cond: (a = 1) Heap Blocks: exact=1 Buffers: shared hit=174 -\u0026gt; Bitmap Index Scan on idx_tab4_a (cost=0.00..4.28 rows=16 width=0) (actual time=0.023..0.023 rows=173 loops=1) Index Cond: (a = 1) Buffers: shared hit=1 Planning Time: 0.057 ms Execution Time: 0.913 ms (9 rows) --Primary key query with lowered fillfactor lzldb=\u0026gt; explain (analyze,buffers) select * from tab5 where a=1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tab5 (cost=4.39..56.41 rows=14 width=4012) (actual time=3.367..3.369 rows=1 loops=1) Recheck Cond: (a = 1) Heap Blocks: exact=1 Buffers: shared hit=1434 -\u0026gt; Bitmap Index Scan on idx_tab5_a (cost=0.00..4.39 rows=14 width=0) (actual time=0.195..0.195 rows=1429 loops=1) Index Cond: (a = 1) Buffers: shared hit=5 Planning Time: 0.059 ms Execution Time: 3.390 ms After lowering fillfactor, the reduction in shared hits is very significant, and Execution Time improves several times over. In fact, both data pages and index pages decreased.\nSo, on update-heavy production tables, lowering table and index fillfactor can mitigate bloat problems.\nSummary # Although index bloat always accompanies table bloat, their principles differ. HOT doesn\u0026rsquo;t update index entries; cross-page updates create new index entries.\nLowering table and index fillfactor can slow bloat in update-heavy production tables, ultimately also slowing down SQL queries like primary key lookups.\nThere are also several kernel-level features for improving index space efficiency:\nCleaning dead index entries during index scans (index tuple deletion) Cleaning dead index entries during index splits (Bottom-Up index tuple deletion) Vacuum marking pages of entirely dead index entries (Deleting entire pages during VACUUM) Quickly locating recycled index pages during index splits (Placing deleted pages in the FSM) references # src/backend/access/nbtree/README https://mp.weixin.qq.com/s/GBN7dFQU72BfzvLSzlLmYA pg事务：事务相关元组结构 https://www.cybertec-postgresql.com/en/killed-index-tuples/ https://www.cybertec-postgresql.com/en/index-bloat-reduced-in-postgresql-v14/?spm=a2c6h.12873639.article-detail.8.2f153438mIV8JK https://www.cybertec-postgresql.com/en/b-tree-index-improvements-in-postgresql-v12/ https://www.cybertec-postgresql.com/en/b-tree-index-deduplication/\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/from-extremely-slow-unique-index-scan-to-index-bloat/","section":"Posts","summary":"How Did a Primary Key Query Access Multiple Data Pages? # Continuing from the previous article: A Classic Case of Long Transactions, Table Bloat, and LIMIT Problems, there was one point not explained in detail:\nWhy does a query using the primary key generate so many shared hits? Why does index bloat cause access to multiple data pages? Can’t data outside the page be located through the corresponding index entry? This relates to index version management — indexes do carry some version information, but not much. Let’s first review PostgreSQL’s btree index structure.\n","title":"From Extremely Slow Unique Index Scan to Index Bloat","type":"posts"},{"content":" A Brief Introduction to HikariCP # \u0026ldquo;Hikari\u0026rdquo; means \u0026ldquo;light\u0026rdquo; in Japanese — HikariCP aims to be a Connection Pool as light and fast as light. This nearly Java-only middleware connection pool is extremely lightweight and performance-focused. HikariCP is now the default connection pool for Spring Boot, and with the proliferation of Spring Boot and microservices, HikariCP usage continues to grow.\nOn the HikariCP GitHub homepage, there\u0026rsquo;s a performance comparison: （https://github.com/brettwooldridge/HikariCP-benchmark）\nIt appears to crush all other database connection pool middleware. However, this performance comparison is somewhat dated and lacks a comparison with Alibaba\u0026rsquo;s homegrown pinnacle connection pool, Druid. I briefly checked Druid\u0026rsquo;s GitHub page — it actually has slightly more stars than HikariCP. Druid is clearly stronger in terms of functionality. As for which has better performance, it even sparked a spat between experts, and I haven\u0026rsquo;t seen any rigorous performance comparison report yet. But that\u0026rsquo;s not the focus of this article\u0026hellip; this article is just to get a basic understanding of HikariCP.\nKey Connection Pool Parameters # There aren\u0026rsquo;t that many parameters. Let\u0026rsquo;s pick the important ones:\nParameter Meaning minimumIdle This property controls the minimum number of idle connections HikariCP tries to maintain in the pool. If the number of idle connections drops below this value and the total number of connections in the pool is less than maximumPoolSize, HikariCP will do its best to quickly and efficiently add additional connections. However, for maximum performance and responsiveness to peak demand, we recommend not setting this value and instead letting HikariCP act as a fixed-size connection pool. Default: same as maximumPoolSize. maximumPoolSize This property controls the maximum size the pool can reach, including both idle and in-use connections. Basically, this value determines the upper limit of actual connections to the database backend. A reasonable value is best determined by your execution environment. When the pool reaches this size and no idle connections are available, calls to getConnection() will block until timeout after connectionTimeout milliseconds. Default: 10 maxLifetime This property controls the maximum lifetime of connections in the pool. A connection in use will never be retired — it is only removed when closed. To avoid mass connection eviction in the pool, this property applies a slight negative attenuation to each connection. We strongly recommend setting this value, and it should be a few seconds shorter than any database or infrastructure-imposed connection time limit. A value of 0 means no maximum lifetime (infinite lifetime), subject to idleTimeout constraints. Minimum allowed: 30000ms (30 seconds). Default: 1800000 (30 minutes). idleTimeout This property controls the maximum time a connection is allowed to sit idle in the pool. This setting only applies when minimumIdle is defined as less than maximumPoolSize. Once the pool reaches minimumIdle connections, idle connections are not retired. Whether a connection is considered idle and retired has a maximum variation of +30 seconds, with an average variation of +15 seconds. A connection is never considered idle and retired before this timeout. A value of 0 means idle connections are never removed from the pool. Minimum allowed: 10000ms (10 seconds). Default: 600000 (10 minutes). keepaliveTime This property controls how frequently HikariCP will attempt to keep a connection alive to prevent it from timing out due to database or network infrastructure. This value must be less than maxLifetime. The \u0026ldquo;keepalive\u0026rdquo; operation only occurs on idle connections. Minimum allowed: 30000ms (30 seconds), but the ideal value is in the range of a few minutes. Default: 0 (disabled). The keepaliveTime parameter should be set lower than the database idle connection timeout, TCP idle connection timeout, and all other infrastructure idle timeouts. For PostgreSQL, HikariCP\u0026rsquo;s keepaliveTime should be set to less than PG\u0026rsquo;s idle_in_transaction_session_timeout.\nClearly, maximumPoolSize represents the maximum number of connections to the database. Of course, in general, the actual number of connections in the database won\u0026rsquo;t always stay at maximumPoolSize because the application can\u0026rsquo;t run at peak load from start to finish. Even after a request peak passes, those idle connections should be released after some time according to idleTimeout or maxLifetime settings. To ensure database availability, this value should be set somewhat lower than the database\u0026rsquo;s maximum connections. For PostgreSQL, maximumPoolSize should be set to less than PG\u0026rsquo;s max_connections. There\u0026rsquo;s room for tuning this parameter, which we\u0026rsquo;ll discuss below.\nminimumIdle is the minimum number of idle connections. For example, if minimumIdle=100 and the database has 10 active sessions, theoretically the total connections in the database should be 100+10. Due to possible connection storms, the actual database connections might be slightly more than active+minimumIdle, but certainly less than maximumPoolSize.\nWhy are database connections far greater than minimumIdle?\nTheoretically, total database connections should only be slightly more than minimumIdle. However, from my actual observation of multi-node connection pool scenarios, even with only 10+ active connections, total database connections far exceed minimumIdle. Observing min(backend_start) and min(state_change) in pg_stat_activity, they stay around maxLifetime, indicating that connection recycling is working. It seems new requests always prefer to establish new connections rather than reuse existing idle ones. Personally, I suspect multi-node deployment is one reason — each node has a low minimumIdle, and some component nodes may have more requests, with instantaneous request counts exceeding minimumIdle, thus creating new connections. Second, it\u0026rsquo;s related to the maxLifetime parameter — maxLifetime\u0026rsquo;s purpose is to rotate connections, releasing those constantly in use. This means used connections need time to be released and ideally shouldn\u0026rsquo;t be reused to avoid extending the release cycle.\nConnection Pool Sizing # Impact of Excessive Connections # In the database world, \u0026ldquo;as the number of database connections increases, database performance always degrades to some extent.\u0026rdquo;\nFor example, Oracle\u0026rsquo;s connection count impact on performance — refer to this video. With unchanged resource configuration and JDBC concurrency, reducing connections from 2048 to 1024 halved the request response time; reducing to 96 connections dropped response time by tens of times!\nWhat\u0026rsquo;s the Right Number of Connections? # Unless you have a database server that has 1000 cores, it is very unlikely that you really want a maximumPoolSize of 2000.\nUnless your database has 1000 cores, you shouldn\u0026rsquo;t have 2000 connections.\nAt the most basic level, the database connection count should be set to the number of CPU cores — this achieves maximum CPU performance mode. But this isn\u0026rsquo;t the full picture. Database consumption isn\u0026rsquo;t just on CPU, but also on disk and network (memory too, but with relatively less impact). For example, disk reads/writes also take time, and the CPU must wait for disk data to return before proceeding. During I/O wait periods (which can be quite long), it\u0026rsquo;s better for the CPU not to be idle but to serve other processes. Therefore, based on waiting times for disk and other devices, the database connection count should ideally be higher than the number of CPU cores.\nDue to SSD and other disk performance improvements, disk access is now very fast — meaning I/O wait times have decreased, implying connection counts should be tuned even lower.\nTuning too low fails to fully utilize CPU; tuning too high degrades database performance. So what\u0026rsquo;s the right number? HikariCP provides this formula:\nconnections = ((core_count * 2) + effective_spindle_count)\nWhere core_count should not count hyperthreading; effective_spindle_count is the spindle count — if the active dataset is fully cached, effective_spindle_count is zero; as cache hit rate decreases, it should approach the actual spindle count. There\u0026rsquo;s no established formula for SSDs yet, but it\u0026rsquo;s certainly less than the above maximum. Of course, these are all theoretical values — real-world situations are more complex, e.g., long connection issues. See About Pool Sizing for details.\nEven with 10,000 frontend users, the connection pool cannot be 10,000 — even 1,000 is too many. A smaller connection count, with remaining requests waiting in the pool queue, is the best way to maximize database and CPU performance. See the formula above for connection count settings.\nFixed Pool # Fixed pool is a concept advocated by HikariCP\u0026rsquo;s author Brett Wooldridge to solve the connection storm problem. The concept is already mentioned in the minimumIdle parameter description:\nFor maximum performance and responsiveness to peak demand, we recommend not setting minimumIdle and instead letting HikariCP act as a fixed-size connection pool. Default: same as maximumPoolSize.\nSetting minimumIdle=maximumPoolSize creates a fixed-size connection pool. minimumIdle\u0026rsquo;s default value equals maximumPoolSize.\nAs early as 2014, Brett Wooldridge mentioned this concept — see the PG community mailing list. This passage is important, so I\u0026rsquo;ll translate it verbatim:\nIn my experience, even pools that maintain a minimum number of idle connections are problematic in responding to burst demand. If you have a pool with a maximum of 30 connections and a target of 10 minimum idle connections, a burst demand requiring 20 connections means the pool can immediately satisfy 10, but then must try to establish another 10 connections before the application\u0026rsquo;s connection request reaches connectionTimeout. This in turn creates burst demand on the database, slowing down not just connection establishment itself but also transactions that might actually be returning connections to the pool.\nNow, if your peak is 100 connections and your median is 50, this doesn\u0026rsquo;t matter. But I know many workloads where the peak is 1000 and the median is 25 — in such cases you\u0026rsquo;d want to gradually reduce idle connections.\nUltimately, we adopted a maxPoolSize + minIdle model, where by default they are equal (fixed pool).\nWhile I don\u0026rsquo;t doubt that such workloads (1000 active connections) exist, if someone is actually doing this, I\u0026rsquo;d love to hear their reasoning. Unless they have over 128 CPU cores and solid-state storage, they\u0026rsquo;re basically wasting effort.\nThis also means that even if the pool size is fixed, you want to rotate actual sessions in and out so they don\u0026rsquo;t hang onto maximum virtual memory indefinitely.\nWe do this with a maxLifeTime setting to rotate these connections.\nIn real scenarios, fixed pool\u0026rsquo;s protection against connection storm impact is visible. Under fixed pool, when the database\u0026rsquo;s instantaneous active connections spike, the idle connection count drops but the total connection count remains unchanged, and request response time is minimally affected. If maximumPoolSize is set to a value higher than minimumIdle, a connection storm can cause many new sessions to be created instantly, and new session creation is very resource-intensive — this significantly increases request response time.\nConnection Leak Case Study # Since I\u0026rsquo;m not a connection pool expert, I\u0026rsquo;ll just summarize some recently found connection leak information here.\nConnection leaks exhibit the following symptoms:\n\u0026ldquo;Connection is not available\u0026rdquo; exception. Connection leaks, pool saturation, or the database being overwhelmed by excessive active sessions — new requests error out after exceeding connectionTimeout. Growth of active connections. Database monitoring clearly shows an increase in active sessions. Application logs. Application logs also show many connection requests, including active session information. Database views and logs. pg_stat_activity shows all session states and specific SQL, and logs show new connection authentication information. HikariCP leak detection. Requires enabling leakDetectionThreshold. HikariCP can detect connection leaks — this parameter is off by default. For locating connection leaks, you should:\nCheck application logs, especially around the time the problem first occurred. Have a proper monitoring system. Be proficient with debug, trace, and other HikariCP settings. Set the leakDetectionThreshold parameter. Possible causes:\nMisuse of streaming responses; Misuse of raw connections; Prolonged operations within @Transactional method (such as network invocation). Configuration errors, reference Virtual threads, reference References # https://github.com/brettwooldridge/HikariCP\nhttps://github.com/brettwooldridge/HikariCP/issues/2148\nhttps://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing\nhttps://blogs.oracle.com/opal/post/always-use-connection-pools\nhttps://mkyong.com/jdbc/hikaripool-1-connection-is-not-available-request-timed-out-after-30002ms/\nhttps://medium.com/@eremeykin/how-to-deal-with-hikaricp-connection-leaks-part-1-1eddc135b464\nhttps://medium.com/@eremeykin/how-to-deal-with-hikaricp-connection-leaks-part-2-847a9629627f\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/getting-started-with-hikaricp-connection-pool/","section":"Posts","summary":"A Brief Introduction to HikariCP # “Hikari” means “light” in Japanese — HikariCP aims to be a Connection Pool as light and fast as light. This nearly Java-only middleware connection pool is extremely lightweight and performance-focused. HikariCP is now the default connection pool for Spring Boot, and with the proliferation of Spring Boot and microservices, HikariCP usage continues to grow.\nOn the HikariCP GitHub homepage, there’s a performance comparison: （https://github.com/brettwooldridge/HikariCP-benchmark）\n","title":"Getting Started with HikariCP Connection Pool","type":"posts"},{"content":" Preface # PostgreSQL Database Technology Summit Chengdu Stop # Recently (June 17, 2023), the \u0026ldquo;PostgreSQL Database Technology Summit Chengdu Stop\u0026rdquo; organized by the PostgreSQL branch of the China Open Source Software Promotion Alliance was successfully held. I had the honor of participating as a speaker and gained a lot from it. (Summit review and all PPT downloads: PPT downloads are here | PostgreSQL Technology Summit Chengdu Stop Review)\nMy Sharing # My technical sharing topic was: Database History and SSI. I\u0026rsquo;ve noticed that many domestic technical blogs describe transactions inaccurately, which can confuse beginners. Additionally, many colleagues aren\u0026rsquo;t very familiar with transaction history and SSI in PostgreSQL. This time, I collected and summarized accurate definitions of transactions, transaction history, and SSI theoretical foundations from Wikipedia, official SQL standards, and various papers. The main thread of the sharing goes from transaction history to anomalies not present in the SQL-92 standard, to how these anomalies can be eliminated, gradually progressing to how SSI is implemented in PostgreSQL. The entire sharing is divided into 4 parts: Transaction Fundamentals, Transaction History, SSI Theoretical Knowledge, and SSI in PostgreSQL.\nTransaction Fundamentals # Before understanding transaction history and SSI, let\u0026rsquo;s review and revisit some basic transaction knowledge. The entire chapter will revolve around discussing transactions, and basic transaction knowledge will lead into the problems in transaction history.\nWhat is a Transaction? # Original meaning of transaction: A transaction is an exchange, a deal. Exchange is the original meaning of transaction, and what we call transactions in databases comes from this word. Database transaction: A transaction is the basic unit of work in a relational database. For example: Deleting data from table A and inserting data into table B — we can wrap these two actions into one transaction. Both must complete. But due to unexpected factors, the transaction might fail or be canceled halfway through execution. In that case, all operations in the entire transaction must roll back to the state before the transaction — A doesn\u0026rsquo;t delete and B doesn\u0026rsquo;t insert.\nACID # ACID is an important characteristic of database transactions. It determines whether a transaction is reliable and trustworthy. Atomicity: All operations within a transaction either complete entirely or cancel entirely. Like atoms in chemistry — indivisible and unsplittable. If a transaction encounters a problem midway and fails to execute, the entire transaction must roll back. Consistency: When a transaction completes, all data remains in a consistent state. This definition is actually somewhat vague. Transactions generally operate on data, and the state of data in the database gets updated. Due to transaction operations, data transitions from one state to another. This state must be reasonable and legitimate — the data logic must be consistent with real-world logic. This might be abstract, so here\u0026rsquo;s an example: Say A has 100 yuan, B has 200 yuan, their combined total is 300 yuan. Now B transfers 100 yuan to A. Then A has 200 yuan, B has 100 yuan, and their combined total is still 300 yuan. Key point: The data changes in this virtual world should remain consistent with real-world logic. Isolation: The result of executing multiple transactions concurrently must be the same as executing them separately one after another. For example, with 2 transactions, executing them serially one after another must produce the same result as executing them in parallel. (This is the official understanding from Wikipedia and the definition in the SQL standard — please remember this definition, as it\u0026rsquo;s the focus of this article.) Durability: After a transaction completes, changes to data are permanent. If updated data is placed in memory and disappears when the machine powers off, then it should go to disk. But is disk storage safe? What if the disk fails? We could have a high-availability architecture writing multiple copies of data. Extending further, we could have geographic-level disaster recovery. But if we push further — what if multiple regions all fail? From an architectural perspective, this question seems to have no answer. But from the user\u0026rsquo;s perspective, it\u0026rsquo;s actually easier to understand. For example, when a user deposits money — they put the cash in, and their account should display that amount. This number is permanent for the user. The user believes that even if the sky falls, their account should have this number. That is the meaning of durability.\nANSI SQL-92 Standard # In 1992, the American National Standards Institute ANSI SQL-92 standard defined 4 isolation levels and 3 anomaly phenomena. Although the database industry today mostly follows ISO international standards, this 1992 American standard had a huge impact on the database industry. I believe many database practitioners are familiar with the 4 isolation levels.\nIsolation Levels in the SQL-92 Standard # ANSI SQL-92 defines 4 isolation levels: Transaction isolation levels from high to low. Notice Serializable: when all transactions in the system execute in parallel, there is no difference from executing them serially — transactions do not affect each other. Doesn\u0026rsquo;t this resemble the definition of Isolation in ACID? All 4 isolation levels can satisfy all-or-nothing execution of transactions. They only differ in their definitions of isolation. All isolation levels can have atomicity, consistency, and durability, but different isolation levels have different isolation characteristics. By definition, only Serializable fully satisfies ACID.\nAnomaly Phenomena in the SQL-92 Standard # The SQL-92 standard defines 3 anomaly phenomena. There are many definitions online, but many are not entirely accurate. Here we directly extract the definitions of the 3 anomaly phenomena from the SQL-92 standard document:\nDirty Read: Transaction T1 updates a row. Transaction T2 can read this row before T1 commits. If T1 executes a rollback, T2 will have read a row that was never committed. Dirty reads have an obvious problem — the user may not know whether the money has actually arrived. Before the transaction completes, the user can query and see money transferred into the account, but if the transaction fails and rolls back for some reason, the money disappears again. This is hard for users to understand.\nNon-repeatable Read: Transaction T1 reads a row. Transaction T2 updates or deletes that row and commits. If T1 reads that row again, it will find the row has been changed or deleted. Phantom Read: Transaction T1 reads N rows matching certain conditions. Transaction T2 executes SQL that generates rows satisfying these conditions. When T1 reads again, it finds inconsistent row results. The difference between non-repeatable read and phantom read is: one is caused by other transactions updating or deleting leading to inconsistent reads within the same transaction; the other is caused by other transactions inserting leading to inconsistent reads within the same transaction.\nSQL-92 Standard and PostgreSQL # In the SQL-92 standard, isolation levels and anomaly phenomena have a stepped relationship. Except for Serializable which has no anomalies, each isolation level adds anomaly phenomena step by step. Now let\u0026rsquo;s look at the following table — this is the isolation levels and anomaly phenomena in PostgreSQL, which is different from the SQL-92 standard.\nWhy is PostgreSQL\u0026rsquo;s isolation level inconsistent with the SQL-92 standard? # Why is Read Uncommitted inconsistent with the SQL-92 standard? Read Uncommitted is simply too strange. In relational databases, it\u0026rsquo;s hard to imagine a scenario for using Read Uncommitted. It severely violates transaction isolation. PostgreSQL treats \u0026ldquo;Read Uncommitted\u0026rdquo; as \u0026ldquo;Read Committed.\u0026rdquo; Why is Repeatable Read inconsistent with the SQL-92 standard? PostgreSQL implements MVCC (Multi-Version Concurrency Control) through snapshots. The Repeatable Read level in PostgreSQL is actually the Snapshot Isolation level, which doesn\u0026rsquo;t have the Phantom Read anomaly. Although the SQL-92 standard has far-reaching influence, many databases haven\u0026rsquo;t fully implemented it. The ANSI SQL-92 standard has vague definitions. The SQL-92 standard is very representative in the database industry — \u0026ldquo;It\u0026rsquo;s good, but not good enough.\u0026rdquo; Transaction History # History of Transactions # To understand \u0026ldquo;It\u0026rsquo;s good, but not good enough,\u0026rdquo; we need to review transaction history, going back 40 years. Notice the timing of the SQL-92 standard and the \u0026ldquo;Critique of SQL-92.\u0026rdquo; Although the SQL-92 standard was \u0026ldquo;flawed,\u0026rdquo; it still had a profound impact on the database industry. Subsequently, after many serializability theories were proven, PostgreSQL became the first commercial database to implement SSI.\nCritique of the SQL-92 Standard # Shortly after the SQL-92 standard was released, some Microsoft engineers and academics critiqued it and proposed more isolation levels and anomaly phenomena. Where the SQL-92 standard defined 4 isolation levels and 3 anomaly phenomena, the \u0026ldquo;Critique of SQL-92\u0026rdquo; had 6 isolation levels and 8 anomaly phenomena. More isolation levels and anomaly phenomena appeared — they were not defined in ANSI SQL-92. Snapshot Isolation sits between Repeatable Read and Serializable. This is also one of the reasons why PostgreSQL\u0026rsquo;s Repeatable Read and Serializable look so similar. The Write Skew anomaly was identified. It occurs at the Snapshot Isolation level. Isolation Levels of Popular Databases # MySQL at Serializable isolation level: reads acquire shared read locks on data, meaning reads block writes. Oracle can also set the Serializable isolation level and claims to support serializability, but it\u0026rsquo;s not true serializability — it\u0026rsquo;s just Snapshot Isolation. PostgreSQL supports Serializable. It implements serializability on top of Snapshot Isolation, fully named Serializable Snapshot Isolation (SSI), where reads and writes do not block each other. You can see the differences among the three — only PostgreSQL\u0026rsquo;s Serializable has real substance.\nWhy Did Oracle Deceive Us? # What did Oracle deceive us about? It passed off the Snapshot Isolation isolation level as the Serializable isolation level. Why did this happen? If we add Snapshot Isolation to the ANSI SQL-92 standard: The SQL-92 standard defines fewer anomaly phenomena and doesn\u0026rsquo;t define Snapshot Isolation. By the SQL-92 standard\u0026rsquo;s view, Snapshot Isolation looks similar to Serializable. Most relational databases follow the SQL-92 standard, including Oracle. But when better standards later emerged, they didn\u0026rsquo;t make changes. Why Do Weak Isolation Levels Have Academic Problems but Few Serious Real-World Issues? # Anomaly phenomena at non-serializable isolation levels generally require high concurrency to manifest. Low-concurrency databases are unlikely to encounter problems. When anomaly phenomena do occur, some applications may not notice them, or may detect anomalies but find them unimportant. Data might be anomalous, but the application simply returns an error and enters an anomaly handling routine. Costs are too high. Not only is the development cost of database serializable isolation levels high, but applications also need adaptation costs for serializability. Just understanding this complex theory is no easy task. High-level isolation loses some performance. Extensive modification work may be thankless — applications need to choose between \u0026ldquo;high concurrency\u0026rdquo; and \u0026ldquo;no anomaly phenomena.\u0026rdquo; Businesses develop based on mechanisms rather than rules. Businesses somewhat adapt to the anomaly phenomena of weak isolation levels, especially Read Committed. What\u0026rsquo;s the Point of Serializable? # If weak isolation seems to work fine in the real world, what\u0026rsquo;s the point of Serializable? There is actually a point:\nAlthough applications adapt to weak isolation levels, it doesn\u0026rsquo;t mean they truly understand them. Using Serializable, applications can greatly reduce concerns about data anomalies. Except for Serializable, all other isolation levels have their own anomaly phenomena and don\u0026rsquo;t fully satisfy ACID\u0026rsquo;s Isolation property. Serializable can eliminate anomaly phenomena — the \u0026ldquo;termites\u0026rdquo; — fully ensuring data safety. Serializable has been proven theoretically achievable. Some serializable implementations do significantly reduce concurrency, but there are other implementations with minimal concurrency impact. For example, Serializable Snapshot Isolation (SSI). SSI Theoretical Knowledge # After all that about transaction fundamentals and history, we finally arrive at the concept of SSI. But before understanding SSI, we need to understand two more concepts: Serializable and Snapshot Isolation.\nSerializable # Meaning of Serializable If each transaction itself is correct (satisfying certain integrity conditions), then any serial schedule including these transactions is correct (its transactions still satisfy their conditions): \u0026ldquo;Serial\u0026rdquo; means transactions don\u0026rsquo;t overlap in time and cannot interfere with each other — i.e., there exists complete isolation between them.\nImplementation of Serializable In early transaction development, Serializable was implemented through Strict Two-Phase Locking (S2PL), where reads and writes block each other until the transaction ends. This eliminated anomaly phenomena but S2PL lost high performance. Besides S2PL, there are other ways to achieve serializability, such as Serializable Snapshot Isolation (SSI).\nSignificance of Serializable To ensure no anomalies, Serializable sacrifices some concurrency (varying by implementation approach), but it truly guarantees ACID isolation for data. That is to say, databases that haven\u0026rsquo;t implemented serializability don\u0026rsquo;t fully support ACID properties. Serializable has been proven theoretically achievable, but the real database world is somewhat \u0026ldquo;abnormal.\u0026rdquo; In practice, Serializable is the highest transaction isolation level and is strongly recommended by academics and industry leaders, yet the vast majority of databases run at Read Committed or Snapshot Isolation levels.\nSnapshot Isolation # Definition of Snapshot Isolation Transactions executing under Snapshot Isolation operate on a snapshot of the database taken at the start of the transaction. When the transaction ends, it will only commit successfully if the values it updated haven\u0026rsquo;t been externally changed since the snapshot was taken. As the name implies, Snapshot Isolation uses snapshots, which are widely used to implement MVCC, enabling multi-version concurrency mechanisms to support concurrent transaction execution by users.\nEmergence of Snapshot Isolation ANSI SQL-92 did not define Snapshot Isolation (SI). This isolation level emerged as the database industry evolved. The 1992 ANSI SQL-92 standard was defined based on database locks, so there was no definition for the Snapshot Isolation level. It wasn\u0026rsquo;t proposed until the 1995 \u0026ldquo;Critique\u0026rdquo; appeared.\nSSI # Serializable Snapshot Isolation (SSI) Given the widespread use of Snapshot Isolation and the academic goal that databases should achieve the Serializable isolation level, Serializable Snapshot Isolation (SSI), as the name suggests, implements serializability on top of Snapshot Isolation.\nWhy SSI? Due to the vagueness of the ANSI SQL-92 standard, although it didn\u0026rsquo;t define Snapshot Isolation, many databases actually use it. And Snapshot Isolation also has some anomaly phenomena (including Write Skew). SSI emerged to address these anomaly phenomena.\nAdvantages of SSI over S2PL Traditional serializability is implemented through S2PL. Under S2PL, write operations block other transactions\u0026rsquo; reads and writes. Although it achieves serializability without Write Skew anomalies, it generates many lock conflicts, reducing concurrency performance. In contrast, MVCC implemented through snapshots has non-blocking reads and writes, with only write-write conflicts. SSI built on this foundation has much less impact on concurrency compared to traditional S2PL.\nPostgreSQL Implements SSI PostgreSQL began implementing SSI in version 9.1, becoming the first commercial database to implement SSI.\nThree Types of Dependencies # Read-Write Dependency (wr): Transaction T1 writes a version of a data item, and transaction T2 reads this version, meaning T1 precedes T2. Write-Write Dependency (ww): Transaction T1 writes a version of a data item, and transaction T2 replaces this version with a new one, meaning T1 precedes T2. Read-Write Anti-dependency (rw): Transaction T1 writes a version of a data item, and transaction T2 reads the version before this one, meaning T2 precedes T1.\nWrite Skew Theory # When certain conflicts form a cycle, serialization anomalies occur. That is to say, some concurrently executing transactions are theoretically non-serializable. One of the more easily understood examples is Write Skew. Write skew only occurs in the rw model — ww and wr won\u0026rsquo;t cause write skew — and transactions must be under concurrent conditions for it to appear.\nSimple Write Skew: Transaction T1 has an rw anti-dependency on T2, and T2 also has an rw anti-dependency on T1. The concurrent execution of these two transactions is non-serializable.\nReal-World Write Skew Problems # Many real-world cases can produce Write Skew anomalies. Let\u0026rsquo;s use the classic black-and-white ball problem to understand Write Skew: There are 4 balls in a bag: 2 white and 2 black. Now there are two transactions, P and Q. P changes all black balls to white, Q changes all white balls to black. There can be two serial executions: \u0026lt;P, Q\u0026gt; or \u0026lt;Q, P\u0026gt;. In both cases, the final result is 4 white balls or 4 black balls. However, Snapshot Isolation allows another result: Transaction P takes out 2 black balls Transaction Q takes out 2 white balls Transaction P changes all black balls in hand to white and puts them back Transaction Q changes all white balls in hand to black and puts them back Now the bag still has 2 black balls and 2 white balls. This is impossible in any serial execution. But this is valid under Snapshot Isolation: each transaction maintains a consistent view of the database, and its write set doesn\u0026rsquo;t overlap with any concurrent transaction\u0026rsquo;s write set, resulting in the white and black balls exchanging.\nWe can also make the problem more concrete and practical. Here\u0026rsquo;s a rough example: Suppose I have several bank cards, half frozen and half unfrozen. At one terminal, I execute freezing all cards. At another terminal, I immediately execute unfreezing all cards. From an intent perspective, my cards should all be unfrozen. But a strange phenomenon occurs: previously frozen cards become unfrozen, and previously unfrozen cards become frozen. As a customer, I would be confused.\nThe black-and-white ball problem illustrates: Snapshot Isolation execution results are inconsistent with Serializable execution results. Under Snapshot Isolation, a Write Skew anomaly occurs, and data results don\u0026rsquo;t match expectations.\nSSI in PostgreSQL # How PostgreSQL Handles SSI # It\u0026rsquo;s actually simple — cancel the pivot transaction that forms the \u0026ldquo;dangerous structure.\u0026rdquo; We first set the isolation level to Serializable for both. The table has some white balls and some black balls.\nT1 T2 set default_transaction_isolation = \u0026lsquo;serializable\u0026rsquo;; set default_transaction_isolation = \u0026lsquo;serializable\u0026rsquo;; begin; update dots set color = \u0026lsquo;black\u0026rsquo; where color = \u0026lsquo;white\u0026rsquo;; begin; update dots set color = \u0026lsquo;white\u0026rsquo; where color = \u0026lsquo;black\u0026rsquo;; commit; commit; ERROR: could not serialize access due to read/write dependencies among transactions DETAIL: Reason code: Canceled on identification as a pivot, during commit attempt. HINT: The transaction might succeed if retried. Transaction 1 changes all white to black, Transaction 2 changes all black to white, then both commit. The first transaction to commit succeeds, the second fails. The error says: could not serialize access due to read/write dependencies among transactions, canceled on identification as a pivot. If you retry the transaction, it might succeed. Of course it would succeed here — the other transaction has already completed, so one transaction alone cannot form a dependency cycle. At other isolation levels like Repeatable Read or Read Committed, these two transactions would execute without any error, running normally, but the data results would differ from SSI\u0026rsquo;s results.\nPostgreSQL SSI Implementation Optimizations # PostgreSQL implements Serializable SSI on top of Snapshot Isolation and has made many optimizations to improve concurrency at high isolation levels. PostgreSQL\u0026rsquo;s SSI optimizations mainly include 3 points: Safe Snapshots: Read-only transactions that won\u0026rsquo;t create cyclic structures don\u0026rsquo;t need conflict detection, reducing checking overhead and memory burden.\nDeferrable Transactions: Deferrable transactions can be retried. When a \u0026ldquo;dangerous structure\u0026rdquo; is detected, the deferrable transaction is canceled and then attempted again. Deferrable transactions need to be explicitly declared.\nDetection Granularity Escalation: Multiple fine-grained locks can be combined into coarse-grained locks to reduce memory overhead.\nOptimization Results — Performance Benchmark Comparison: The green line is the Snapshot Isolation baseline. The blue line shows PostgreSQL\u0026rsquo;s SSI performance, which is already very close to Snapshot Isolation. The brown line is SSI without read-only transactions — all data-changing transactions — showing how much read-only transaction optimization improves performance. In typical business systems, read-only transactions outnumber change transactions. The red line is serializability implemented through Strict Two-Phase Locking — the performance is abysmal.\nThe table below shows concurrency pressure and transaction failure rates. Since some transactions need to be canceled to break cycles, Serializable inevitably cancels more transactions than weak isolation. This table also shows that PostgreSQL\u0026rsquo;s SSI has far higher concurrency and transaction success rates than Strict Two-Phase Locking.\nOptimization Results — Request Volume and Failure Rate: Summary # Serializable can simplify system development problems. Developers don\u0026rsquo;t need to worry about transaction anomalies under concurrency, especially in today\u0026rsquo;s increasingly high-concurrency systems. PostgreSQL\u0026rsquo;s Serializable is clearly better than the Strict Two-Phase Locking model. Not only better performance, but also lower transaction abort probability. PostgreSQL is the first commercial database to implement SSI, while many traditional relational databases don\u0026rsquo;t support serializability at all. PostgreSQL has taken a big step forward. PostgreSQL not only implemented SSI but also made many optimizations on top of it, such as read-only transaction and memory optimizations, with significant results. ","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/history-of-transactions-and-ssi-postgresql-database-technology-summit-chengdu-stop-sharing/","section":"Posts","summary":"Preface # PostgreSQL Database Technology Summit Chengdu Stop # Recently (June 17, 2023), the “PostgreSQL Database Technology Summit Chengdu Stop” organized by the PostgreSQL branch of the China Open Source Software Promotion Alliance was successfully held. I had the honor of participating as a speaker and gained a lot from it. (Summit review and all PPT downloads: PPT downloads are here | PostgreSQL Technology Summit Chengdu Stop Review)\n","title":"History of Transactions and SSI — PostgreSQL Database Technology Summit Chengdu Stop Sharing","type":"posts"},{"content":"How does the database access system tables before pg_class exists? This question can be divided into two stages:\nDatabase cluster initialization — at this point no database exists at all, so how to construct and access system tables like pg_class is a problem. Private memory initialization of system tables. PG stores system table information in the local backend process. How does the backend load pg_class during initialization? Initializing the Data Dictionary # When the database hasn\u0026rsquo;t been initialized yet, it\u0026rsquo;s obviously impossible to access the data dictionary to initialize objects like database, pg_class, etc., because without a database you can\u0026rsquo;t CREATE DATABASE, and without pg_class you can\u0026rsquo;t look up metadata information.\nPG uses a special language in BKI files to initialize some data structures, then initializes a primitive database in bootstrap mode1.\nCompilation Phase: genbki.h \u0026amp; genbki.pl # src/include/catalog/genbki.h:\n* genbki.h defines CATALOG(), BKI_BOOTSTRAP and related macros * so that the catalog header files can be read by the C compiler. * (These same words are recognized by genbki.pl to build the BKI * bootstrap file from these header files.) genbki.h is quite minimal — mainly macro definitions for catalog-related operations, as well as macros for the BKI bootstrap file. Data dictionary header files all include genbki.h.\ngenbki.pl reads the .h table definition files from /src/include/catalog during compilation (excluding pg_*_d.h), and creates the postgres.bki file and pg_*_d.h header files.\nTaking pg_class as an example:\n[postgres@catalog]$ ll |grep pg_class -rw-r----- 1 postgres postgres 3682 Aug 6 2019 pg_class.dat lrwxrwxrwx 1 postgres postgres 86 Apr 8 20:31 pg_class_d.h -\u0026gt; /lzl/soft/postgresql-11.5/src/backend/catalog/pg_class_d.h -rw-r----- 1 postgres postgres 5219 Aug 6 2019 pg_class.h The pg_*_d.h header files are generated by genbki.pl. All pg_*_d.h files contain the following line:\nIt has been GENERATED by src/backend/catalog/genbki.pl\nEach data dictionary has a struct typedef struct FormData_*catalogname* for storing the row data of the data dictionary2, for example pg_class\u0026rsquo;s FormData_pg_class:\nCATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,RelationRelation_Rowtype_Id) BKI_SCHEMA_MACRO { /* oid */ Oid\toid; /* class name */ NameData\trelname; /* OID of namespace containing this class */ Oid\trelnamespace BKI_DEFAULT(pg_catalog) BKI_LOOKUP(pg_namespace); /* OID of entry in pg_type for relation\u0026#39;s implicit row type, if any */ Oid\treltype BKI_LOOKUP_OPT(pg_type); /* OID of entry in pg_type for underlying composite type, if any */ Oid\treloftype BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_type); /* class owner */ Oid\trelowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid); ... /* access-method-specific options */ text\treloptions[1] BKI_DEFAULT(_null_); /* partition bound node tree */ pg_node_tree relpartbound BKI_DEFAULT(_null_); #endif } FormData_pg_class; pg_class\u0026rsquo;s OID is hardcoded as 1259, and all fields are in the FormData_pg_class struct.\nAfter initializing the struct for user data storage, the corresponding .dat file is used to insert base data. pg_class inserts 4 rows of data, which can be understood as bootstrap items (49 data dictionary tables in PG15):\n{ oid =\u0026gt; \u0026#39;1247\u0026#39;, relname =\u0026gt; \u0026#39;pg_type\u0026#39;, reltype =\u0026gt; \u0026#39;pg_type\u0026#39; }, { oid =\u0026gt; \u0026#39;1249\u0026#39;, relname =\u0026gt; \u0026#39;pg_attribute\u0026#39;, reltype =\u0026gt; \u0026#39;pg_attribute\u0026#39; }, { oid =\u0026gt; \u0026#39;1255\u0026#39;, relname =\u0026gt; \u0026#39;pg_proc\u0026#39;, reltype =\u0026gt; \u0026#39;pg_proc\u0026#39; }, { oid =\u0026gt; \u0026#39;1259\u0026#39;, relname =\u0026gt; \u0026#39;pg_class\u0026#39;, reltype =\u0026gt; \u0026#39;pg_class\u0026#39; }, postgres=# select oid,relname from pg_class where oid::int \u0026gt;=1247 and oid::int\u0026lt;=1259; oid | relname ------+-------------- 1247 | pg_type 1249 | pg_attribute 1255 | pg_proc 1259 | pg_class Once the base data dictionary is written, everything else can be generated from it.\nDatabase Initialization Phase: initdb \u0026amp; postgres.bki # Comment from initdb.c:\n* To create template1, we run the postgres (backend) program in bootstrap * mode and feed it data from the postgres.bki library file. After this * initial bootstrap phase, some additional stuff is created by normal * SQL commands fed to a standalone backend. The backend is launched in bootstrap mode and runs the postgres.bki script. postgres.bki can execute relevant functions without any system tables. Only after this can normal SQL files and standard backend processes be used.\ntemplate1 can be called the bootstrap database. The postgres and template0 databases are created only after template1 is established:\nvoid initialize_data_directory(void) { ... /* Bootstrap template1 */ bootstrap_template1(); ... make_template0(cmdfd); make_postgres(cmdfd); PG_CMD_CLOSE; check_ok(); } Once template1 exists, make_template0 and make_postgres create the corresponding template0 and postgres databases, using the normal SQL CREATE DATABASE command:\n/* * copy template1 to postgres */ static void make_postgres(FILE *cmdfd) { const char *const *line; /* * Just as we did for template0, and for the same reasons, assign a fixed * OID to postgres and select the file_copy strategy. */ static const char *const postgres_setup[] = { \u0026#34;CREATE DATABASE postgres OID = \u0026#34; CppAsString2(PostgresDbOid) \u0026#34; STRATEGY = file_copy;\\n\\n\u0026#34;, \u0026#34;COMMENT ON DATABASE postgres IS \u0026#39;default administrative connection database\u0026#39;;\\n\\n\u0026#34;, NULL }; for (line = postgres_setup; *line; line++) PG_CMD_PUTS(*line); } Backend Local Cache of Data Dictionary # For PG private memory basics, refer to PostgreSQL Memory Analysis3.\nPG\u0026rsquo;s data dictionary information is stored in the local backend process, not shared. The data dictionary cache mainly focuses on syscache/catcache and relcache, which cache system table and table schema information respectively.\nsyscache/catcache is used to cache system tables, with syscache acting as the upper layer of catcache. syscache is an array where each element corresponds to a catcache, and each catcache corresponds to a system table1.\n//PG15.3 SysCacheSize=35 static CatCache *SysCache[SysCacheSize]; When PG forks a backend, it calls InitPostgres, which calls the initialization functions for syscache/catcache and relcache. Let\u0026rsquo;s look at backend initialization.\nsyscache/catcache Initialization # struct cachedesc { Oid\treloid;\t/* OID of the relation being cached */ Oid\tindoid;\t/* OID of index relation for this cache */ int\tnkeys;\t/* # of keys needed for cache lookup */ int\tkey[4];\t/* attribute numbers of key attrs */ int\tnbuckets;\t/* number of hash buckets for this cache */ }; static const struct cachedesc cacheinfo[] = { { ... {RelationRelationId,\t/* RELNAMENSP */ ClassNameNspIndexId, 2, { Anum_pg_class_relname, Anum_pg_class_relnamespace, 0, 0 }, 128 }, {RelationRelationId,\t/* RELOID */ ClassOidIndexId, 1, { Anum_pg_class_oid, 0, 0, 0 }, 128 ... }; For example, Anum_pg_class_oid is defined in pg_class_d.h generated by genbki.pl:\n#define Anum_pg_class_oid 1 reloid is the OID:\nselect oid,relname from pg_class where oid::int \u0026gt;=1247 and oid::int\u0026lt;=1259; oid | relname ------+-------------- 1259 | pg_class InitCatalogCache actually initializes the syscache array, i.e., initializes all catcaches. InitCatalogCache eventually fully initializes CatCache through InitCatCache (one of which is for pg_class):\nvoid InitCatalogCache(void) { ... for (cacheId = 0; cacheId \u0026lt; SysCacheSize; cacheId++) { SysCache[cacheId] = InitCatCache(cacheId, cacheinfo[cacheId].reloid, cacheinfo[cacheId].indoid, cacheinfo[cacheId].nkeys, cacheinfo[cacheId].key, cacheinfo[cacheId].nbuckets); if (!PointerIsValid(SysCache[cacheId])) elog(ERROR, \u0026#34;could not initialize cache %u (%d)\u0026#34;, cacheinfo[cacheId].reloid, cacheId); /* Accumulate data for OID lists, too */ SysCacheRelationOid[SysCacheRelationOidSize++] = cacheinfo[cacheId].reloid; SysCacheSupportingRelOid[SysCacheSupportingRelOidSize++] = cacheinfo[cacheId].reloid; SysCacheSupportingRelOid[SysCacheSupportingRelOidSize++] = cacheinfo[cacheId].indoid; /* see comments for RelationInvalidatesSnapshotsOnly */ Assert(!RelationInvalidatesSnapshotsOnly(cacheinfo[cacheId].reloid)); } ... CacheInitialized = true; } Then we come to catcache.c.\nInitCatCache allocates memory and manages it in CacheMemoryContext. It only assigns some macro-defined OIDs to the corresponding catcache — at this point, tables are not yet opened:\n/* *\tInitCatCache * *\tThis allocates and initializes a cache for a system catalog relation. *\tActually, the cache is only partially initialized to avoid opening the *\trelation. The relation will be opened and the rest of the cache *\tstructure initialized on the first access. */ CatCache * InitCatCache(int id, Oid reloid, Oid indexoid, int nkeys, const int *key, int nbuckets) { ... oldcxt = MemoryContextSwitchTo(CacheMemoryContext); ... sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; cp = (CatCache *) CACHELINEALIGN(palloc0(sz)); cp-\u0026gt;cc_bucket = palloc0(nbuckets * sizeof(dlist_head)); /* * initialize the cache\u0026#39;s relation information for the relation * corresponding to this cache, and initialize some of the new cache\u0026#39;s * other internal fields. But don\u0026#39;t open the relation yet. */ cp-\u0026gt;id = id; cp-\u0026gt;cc_relname = \u0026#34;(not known yet)\u0026#34;; cp-\u0026gt;cc_reloid = reloid; cp-\u0026gt;cc_indexoid = indexoid; cp-\u0026gt;cc_relisshared = false; /* temporary */ cp-\u0026gt;cc_tupdesc = (TupleDesc) NULL; cp-\u0026gt;cc_ntup = 0; cp-\u0026gt;cc_nbuckets = nbuckets; cp-\u0026gt;cc_nkeys = nkeys; for (i = 0; i \u0026lt; nkeys; ++i) cp-\u0026gt;cc_keyno[i] = key[i]; ... MemoryContextSwitchTo(oldcxt); return cp; } id is the index of the catcache array element. The assigned reloid is the known OID from cacheinfo, and key[4] from cacheinfo is also assigned. Other information is mostly unknown yet — for example, relname, tupdesc — because system tables haven\u0026rsquo;t been opened yet.\ncatcache only opens tables during search operations. Although the function name contains *init*, it\u0026rsquo;s no longer in the initialization process — the relevant functions won\u0026rsquo;t be shown here.\nAfter syscache/catcache initialization completes, there is actually no tuple information at all.\nrelcache Initialization # The relcache initialization is well explained in PostgreSQL Memory Analysis.\nrelcache initialization has 5 phases:\nRelationCacheInitialize - initializes relcache, initially empty RelationCacheInitializePhase2 - initializes shared catalogs and loads 5 global system tables RelationCacheInitializePhase3 - completes relcache initialization and loads 4 basic system tables RelationIdGetRelation - gets rel description by relation id RelationClose - closes a relation Both RelationCacheInitializePhase2 and RelationCacheInitializePhase3 load system tables, and they must be in order.\nRelationCacheInitializePhase2 loads several system tables — interested readers can check the function themselves. RelationCacheInitializePhase3 is the one relevant to our question, let\u0026rsquo;s look at that:\n/* *\tRelationCacheInitializePhase3 * *\tThis is called as soon as the catcache and transaction system *\tare functional and we have determined MyDatabaseId. At this point *\twe can actually read data from the database\u0026#39;s system catalogs. *\tWe first try to read pre-computed relcache entries from the local *\trelcache init file. If that\u0026#39;s missing or broken, make phony entries *\tfor the minimum set of nailed-in-cache relations. Then (unless *\tbootstrapping) make sure we have entries for the critical system *\tindexes. Once we\u0026#39;ve done all this, we have enough infrastructure to *\topen any system catalog or use any catcache. The last step is to *\trewrite the cache files if needed. */ void RelationCacheInitializePhase3(void) { ... if (IsBootstrapProcessingMode() || !load_relcache_init_file(false)) { needNewCacheFile = true; formrdesc(\u0026#34;pg_class\u0026#34;, RelationRelation_Rowtype_Id, false, Natts_pg_class, Desc_pg_class); formrdesc(\u0026#34;pg_attribute\u0026#34;, AttributeRelation_Rowtype_Id, false, Natts_pg_attribute, Desc_pg_attribute); formrdesc(\u0026#34;pg_proc\u0026#34;, ProcedureRelation_Rowtype_Id, false, Natts_pg_proc, Desc_pg_proc); formrdesc(\u0026#34;pg_type\u0026#34;, TypeRelation_Rowtype_Id, false, Natts_pg_type, Desc_pg_type); #define NUM_CRITICAL_LOCAL_RELS 4\t/* fix if you change list above */ } MemoryContextSwitchTo(oldcxt); /* In bootstrap mode, the faked-up formrdesc info is all we\u0026#39;ll have */ if (IsBootstrapProcessingMode()) return; ... /* now write the files */ write_relcache_init_file(true); write_relcache_init_file(false); } } IsBootstrapProcessingMode is specifically designed for bootstrap mode — normal backends don\u0026rsquo;t satisfy this condition.\nload_relcache_init_file(false) attempts to load system table information from the init file. load_relcache_init_file(false) passes false meaning it\u0026rsquo;s a private init file, not a shared one:\n[postgres@16384]$ pwd /pgdata/lzl/data15_6879/base/16384 -- Rough view. strings ignores some info, but table and column names are visible [postgres@16384]$ strings pg_internal.init |grep pg_class pg_class_oid_index pg_class pg_class_relname_nsp_index [postgres@16384]$ strings pg_internal.init |grep -E \u0026#34;pg_class|relname\u0026#34; pg_class_oid_index pg_class relname relnamespace pg_class_relname_nsp_index relname relnamespace If the init file is damaged or doesn\u0026rsquo;t exist, loading the init file fails and enters the branch to load 4 basic system tables:\n// Similar to phase 2, load more system table descriptions if (IsBootstrapProcessingMode() || !load_relcache_init_file(false)) { needNewCacheFile = true; formrdesc(\u0026#34;pg_class\u0026#34;, RelationRelation_Rowtype_Id, false, Natts_pg_class, Desc_pg_class); formrdesc(\u0026#34;pg_attribute\u0026#34;, AttributeRelation_Rowtype_Id, false, Natts_pg_attribute, Desc_pg_attribute); formrdesc(\u0026#34;pg_proc\u0026#34;, ProcedureRelation_Rowtype_Id, false, Natts_pg_proc, Desc_pg_proc); formrdesc(\u0026#34;pg_type\u0026#34;, TypeRelation_Rowtype_Id, false, Natts_pg_type, Desc_pg_type); With the 4 basic tables including pg_class, loading subsequent system table information becomes straightforward.\nReferences # 《PostgreSQL Kernel Analysis》 Chapters 2, 3\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.postgresql.org/docs/current/system-catalog-declarations.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPostgreSQL Memory Analysis\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/how-does-pg-access-basic-system-tables-before-pg_class-exists/","section":"Posts","summary":"How does the database access system tables before pg_class exists? This question can be divided into two stages:\nDatabase cluster initialization — at this point no database exists at all, so how to construct and access system tables like pg_class is a problem. Private memory initialization of system tables. PG stores system table information in the local backend process. How does the backend load pg_class during initialization? Initializing the Data Dictionary # When the database hasn’t been initialized yet, it’s obviously impossible to access the data dictionary to initialize objects like database, pg_class, etc., because without a database you can’t CREATE DATABASE, and without pg_class you can’t look up metadata information.\n","title":"How Does PG Access Basic System Tables Before pg_class Exists?","type":"posts"},{"content":" Index Splitting # When an index block is nearly full, index splitting occurs. Index splitting comes in two forms: 55 and 91: The difference between index splitting and the enq: TX - index contention wait event Whether 55 or 91 splitting, both are normal index behavior as data volume increases. Index splitting is a normal phenomenon caused by growing data volume leading to larger indexes — when an index can\u0026rsquo;t hold more data, it naturally needs more index blocks. There are hardly any scenarios with tables but no indexes (only during initial data loading would one consider inserting data first and building indexes afterward). Although index splitting consumes some resources, in today\u0026rsquo;s Oracle environments it can complete quickly. Only when there are too many indexes does it affect insert efficiency.\nHowever, the enq: TX - index contention wait is NOT normal. enq: TX - index contention indicates that SQL statements are waiting on an index block that is currently being split. Essentially, DML concurrency is too high and all sessions are waiting on the splitting index block.\nWhy does enq: TX - index contention always occur on sequentially inserted columns? Although both 55 and 55 splits are possible in real scenarios, enq: TX - index contention frequently occurs with 91 splits. This is because columns like sequences and timestamps usually have indexes, and sequential inserts are common. The rightmost block is always the hot block, and subsequent inserts must wait for the split block to complete before they can proceed — this causes enq: TX - index contention. Why don\u0026rsquo;t UUID indexes cause enq: TX - index contention? Because UUID indexes are unordered — inserting causes UUID index splits, but it\u0026rsquo;s unlikely that subsequent UUID values also land on that same splitting index block. So UUID has index splitting but doesn\u0026rsquo;t form an enq wait queue leading to enq: TX - index contention.\nSolutions # Note: what we need to solve is the index split wait enq: TX - index contention, not index splitting itself. Solutions:\n1. Reverse Index A reverse index stores key values in the opposite order. For example, for the value \u0026lsquo;1111 0001\u0026rsquo;, a normal index places it after \u0026lsquo;0000 0002\u0026rsquo;; with a reverse index, it\u0026rsquo;s placed before \u0026lsquo;0000 0002\u0026rsquo;. Think about a timestamp column — normally it\u0026rsquo;s a rightmost hot spot. After reversing, seconds, minutes, and hours sort first. One index block might contain data from different months but the same second. This way, the rightmost hot block essentially disappears — reverse indexes scatter hot spots across various index blocks. Limitations: Requires index modification; may lose index range scan capability. Sequentially growing columns cannot use index range scans (e.g., timestamp columns). In some scenarios, reverse key values might still work — requires specific analysis. Syntax:\nCREATE INDEX reveridx ON tablzl (name) REVERSE; 2. Hash-Partitioned Index Creating a hash-partitioned index on a regular table is equivalent to keeping the table unchanged but partitioning the index, thus scattering the rightmost hot block across partitions. For example, an 8-partition hash-partitioned index divides the index into 8 segments, creating 8 rightmost hot spots and alleviating the index split problem. Limitations: Requires index modification; affects index range query performance — requires balancing insert hot spot mitigation vs. query efficiency. Equality and IN queries can efficiently use hash-partitioned indexes. From the official documentation:\nQueries involving equality and IN predicates on index partitioning key can efficiently use global hash partitioned index to answer queries quickly\nHowever, range scan efficiency decreases — the more partitions, the greater the decrease (though more partitions also provide better hot spot relief). This is clearly a balancing act. Tests show that with 8 partitions, logical reads for range scans increase nearly 8x. After partitioning, indexes within each partition remain ordered, and clustering factor differences are minor — the cost of scanning the index is similar, but the cost of table access increases. If a regular index has 8 entries in one block pointing to 1 data block (1 logical read), after hash partitioning across 8 partitions (1 index block each), it becomes 8 logical reads. This is why range scan index performance degrades. Syntax:\nCREATE INDEX cust_last_name_ix ON customers (cust_last_name) GLOBAL PARTITION BY HASH (cust_last_name) PARTITIONS 4; 3. Using Table Partitioning to Scatter Indexes Partition the table and create local indexes to scatter the rightmost hot spots. Limitations: The partition key cannot be the index column (otherwise it defeats the purpose); requires table modification; if existing SQL already has partition key predicates, range scan efficiency is not affected.\n4. Reduce Concurrency Reducing concurrency is the ultimate weapon. Index split contention is fundamentally caused by excessively high concurrency — generally, without dozens of concurrent inserts, index split contention won\u0026rsquo;t occur.\n5. Modify Index Block Size Place index blocks in 16K or 32K tablespaces. In theory, this should help because indexes can hold more data and splitting occurs less frequently. However, performance testing is needed, and other parameters may need adjustment.\n6. Remove the Index Removing the index is also an option. Based on business requirements, if the index is not important, drop it. Or use range queries with partitioned tables, leveraging partition pruning instead of indexes.\nWhy These Approaches Don\u0026rsquo;t Work??? # Increasing ITL transaction slots: Index block transaction slots may also be insufficient under high concurrency — this is indeed similar to index splitting, but the wait event is enq: TX - allocate ITL entry. If this wait is observed and traced to index blocks, it indicates high concurrency on the index. Reverse indexes and hash-partitioned indexes can also help, and adjusting initrans may solve the problem. However, the root causes of these two wait events differ — index splitting doesn\u0026rsquo;t always come with transaction slot issues. Adjusting index block PCTFREE: PCTFREE indicates that when a block\u0026rsquo;s free space falls below PCTFREE, it is no longer recorded in FREELIST and cannot accept new inserts. Consider two cases: increasing and decreasing PCTFREE. Increasing PCTFREE only worsens index splitting. Decreasing PCTFREE seems effective — similar to adjusting block size in principle — but in real scenarios PCTFREE defaults to 10%, which is already hard to reduce further, so the effect is negligible. Rebuilding indexes to reduce fragmentation: This is essentially unrelated — it doesn\u0026rsquo;t solve the rightmost hot block problem. References # https://blog.csdn.net/lihuarongaini/article/details/101299328 https://docs.oracle.com/cd/E11882_01/server.112/e41573/data_acc.htm#PFGRF94786\nAcknowledgments: 豪桑, 用哥\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/how-to-solve-index-split-contention/","section":"Posts","summary":"Index Splitting # When an index block is nearly full, index splitting occurs. Index splitting comes in two forms: 55 and 91: The difference between index splitting and the enq: TX - index contention wait event Whether 55 or 91 splitting, both are normal index behavior as data volume increases. Index splitting is a normal phenomenon caused by growing data volume leading to larger indexes — when an index can’t hold more data, it naturally needs more index blocks. There are hardly any scenarios with tables but no indexes (only during initial data loading would one consider inserting data first and building indexes afterward). Although index splitting consumes some resources, in today’s Oracle environments it can complete quickly. Only when there are too many indexes does it affect insert efficiency.\n","title":"How to Solve Index Split Contention?","type":"posts"},{"content":" Problem Overview # Last night, the business team updated a SQL query. Previously, the query ran very fast without the DATE_CREATED field (the partition key). After the release, the partition field was added to reduce the number of partitions accessed. However, after adding it, the UPDATE execution actually became slower.\nBefore:\nupdate TABLE_RECORD set IS_DELETED = \u0026#39;1\u0026#39;, DATE_UPDATED = LOCALTIMESTAMP(0) WHERE APPL_NO = $1 AND IS_DELETED = \u0026#39;0\u0026#39; After:\nupdate TABLE_RECORD set IS_DELETED = \u0026#39;1\u0026#39;, DATE_UPDATED = LOCALTIMESTAMP(0) WHERE APPL_NO = $1 AND IS_DELETED = \u0026#39;0\u0026#39; AND DATE_CREATED \u0026gt; now() - interval \u0026#39;31\u0026#39; day AND DATE_CREATED \u0026lt; now() Before the release, access time was in milliseconds. After the release, access time was 10 seconds. The SQL runs frequently, and the business found this unacceptable.\nProblem Analysis # The Execution Plan Appeared Correct # Table structure:\n## \\d+ TABLE_RECORD Partitioned table \u0026#34;public.TABLE_RECORD\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------------------------+-----------------------------+-----------+----------+---------------------------------------------------+----------+--------------+-------------------------- id_TABLE_RECORD | character varying(32) | | not null | nextval(\u0026#39;seq_TABLE_RECORD\u0026#39;::regclass) | extended | | appl_no | character varying(100) | | | | extended | | r_appl_no | character varying(100) | | | | extended | | ... created_by | character varying(100) | | not null | \u0026#39;sys\u0026#39;::character varying | extended | | date_created | timestamp without time zone | | not null | now() | plain | | updated_by | character varying(100) | | not null | \u0026#39;sys\u0026#39;::character varying | extended | | date_updated | timestamp without time zone | | not null | now() | plain | | Partition key: RANGE (date_created) Indexes: \u0026#34;date_TABLE_RECORD\u0026#34; btree (date_created) \u0026#34;idx_dateupdated\u0026#34; btree (date_updated) \u0026#34;idx_applnodeleted\u0026#34; btree (appl_no, is_deleted) \u0026#34;nk_TABLE_RECORD\u0026#34; btree (appl_no) Partitions: TABLE_RECORD_202211 FOR VALUES FROM (\u0026#39;2022-11-01 00:00:00\u0026#39;) TO (\u0026#39;2022-12-01 00:00:00\u0026#39;), ... TABLE_RECORD_202303 FOR VALUES FROM (\u0026#39;2023-03-01 00:00:00\u0026#39;) TO (\u0026#39;2023-04-01 00:00:00\u0026#39;), TABLE_RECORD_202304 FOR VALUES FROM (\u0026#39;2023-04-01 00:00:00\u0026#39;) TO (\u0026#39;2023-05-01 00:00:00\u0026#39;), TABLE_RECORD_202305 FOR VALUES FROM (\u0026#39;2023-05-01 00:00:00\u0026#39;) TO (\u0026#39;2023-06-01 00:00:00\u0026#39;), TABLE_RECORD_202306 FOR VALUES FROM (\u0026#39;2023-06-01 00:00:00\u0026#39;) TO (\u0026#39;2023-07-01 00:00:00\u0026#39;), ... TABLE_RECORD_202512 FOR VALUES FROM (\u0026#39;2025-12-01 00:00:00\u0026#39;) TO (\u0026#39;2026-01-01 00:00:00\u0026#39;), TABLE_RECORD_other DEFAULT This SQL would access partitions from the last 2 months, both of which contained data. The above UPDATE would only update one row.\nAt first, analyzing the problem was very confusing because when we ran EXPLAIN, the execution plan looked fine.\nEXPLAIN partition scan info:\n-\u0026gt; Index Scan using TABLE_RECORD_202302_date_created_idx on TABLE_RECORD_202302 TABLE_RECORD_4 (cost=0.44..5.47 rows=1 width=485) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using TABLE_RECORD_202303_date_created_idx on TABLE_RECORD_202303 TABLE_RECORD_5 (cost=0.44..5.47 rows=1 width=482) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using TABLE_RECORD_202304_date_created_idx on TABLE_RECORD_202304 TABLE_RECORD_6 (cost=0.44..5.47 rows=1 width=481) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using idx_applnodeleted_25 on TABLE_RECORD_202305 TABLE_RECORD_7 (cost=0.43..30.49 rows=1 width=483) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using idx_applnodeleted_14 on TABLE_RECORD_202306 TABLE_RECORD_8 (cost=0.56..45.11 rows=18 width=485) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using idx_applnodeleted_38 on TABLE_RECORD_202307 TABLE_RECORD_9 (cost=0.14..5.17 rows=1 width=3502) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using idx_applnodeleted_1 on TABLE_RECORD_202308 TABLE_RECORD_10 (cost=0.14..5.17 rows=1 width=3502) Partition data distribution:\nselect count(*),tableoid::regclass from TABLE_RECORD group by 2; count | tableoid -------+--------------------------------- 56558 | TABLE_RECORD_202303 4436 | TABLE_RECORD_202211 6929 | TABLE_RECORD_202306 945 | TABLE_RECORD_202305 1413 | TABLE_RECORD_202304 5499 | TABLE_RECORD_202212 1486 | TABLE_RECORD_202301 4722 | TABLE_RECORD_202302 The execution plan appeared to access different indexes for different partitions:\ndate_TABLE_RECORD: index on the partition key idx_applnodeleted: composite index on appl_no, is_deleted In reality, the SQL could prune partitions using the DATE_CREATED (last 31 days) field. But if it used the index on that field, there would be no selectivity at all. The composite index idx_applnodeleted on appl_no, is_deleted had much better selectivity within partitions, so the correct execution plan should choose the idx_applnodeleted composite index.\nThe EXPLAIN plan above is not the actual execution plan, but we can see that the May and June partitions did use the correct index — the appl_no, is_deleted composite index.\nTo view the actual execution plan, we need to execute the SQL. So we changed the UPDATE to a SELECT:\n# explain (analyze,buffers,timing,verbose) select count(*) from TABLE_RECORD WHERE APPL_NO = \u0026#39;LZLMATH20230132302302\u0026#39; AND IS_DELETED = \u0026#39;0\u0026#39; AND DATE_CREATED \u0026gt; now() - interval \u0026#39;31\u0026#39; day AND DATE_CREATED \u0026lt; now() ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=266.09..266.10 rows=1 width=8) (actual time=0.565..0.566 rows=1 loops=1) Output: count(*) Buffers: shared hit=48 -\u0026gt; Append (cost=0.14..265.95 rows=56 width=0) (actual time=0.388..0.558 rows=1 loops=1) Buffers: shared hit=48 Subplans Removed: 37 -\u0026gt; Index Scan using idx_applnodeleted_25 on public.TABLE_RECORD_202305 TABLE_RECORD_1 (cost=0.43..30.39 rows=2 width=0) (actual time=0.059..0.059 rows=0 loops=1) Index Cond: (((TABLE_RECORD_1.appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((TABLE_RECORD_1.is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((TABLE_RECORD_1.date_created \u0026lt; now()) AND (TABLE_RECORD_1.date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) Buffers: shared hit=3 -\u0026gt; Index Scan using idx_applnodeleted_14 on public.TABLE_RECORD_202306 TABLE_RECORD_2 (cost=0.56..42.52 rows=17 width=0) (actual time=0.328..0.498 rows=1 loops=1) Index Cond: (((TABLE_RECORD_2.appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((TABLE_RECORD_2.is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((TABLE_RECORD_2.date_created \u0026lt; now()) AND (TABLE_RECORD_2.date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) Buffers: shared hit=45 Planning: Buffers: shared hit=5867 Planning Time: 17.195 ms Execution Time: 0.654 ms (18 rows) The SELECT only accessed the May and June partitions, indicating partition pruning worked correctly. Both partitions used the idx_applnodeleted index, so index selection was also correct.\nDirect execution of the SELECT statement returned results in milliseconds:\n## select count(*) from TABLE_RECORD WHERE APPL_NO = \u0026#39;LZLMATH20230132302302\u0026#39; AND IS_DELETED = \u0026#39;0\u0026#39; AND DATE_CREATED \u0026gt; now() - interval \u0026#39;31\u0026#39; day AND DATE_CREATED \u0026lt; now() ; count ------- 1 (1 row) Time: 4.946 ms At this point in the analysis, the execution plan appeared normal and execution time appeared normal.\nThe Business SQL Was Still Slow # However, slow SQL still appeared in the PostgreSQL logs — the UPDATE took 10 seconds:\n2023-06-29 11:06:45.077 CST,\u0026#34;lzldbopr\u0026#34;,\u0026#34;lzldb\u0026#34;,116286,\u0026#34;30.88.78.90:51871\u0026#34;,649cdebf.1c63e,7,\u0026#34;UPDATE\u0026#34;,2023-06-29 09:30:39 CST,759/12440291,4002354803,LOG,00000,\u0026#34;duration: 10287.105 ms \u0026#34; plan: Query Text: update TABLE_RECORD set IS_DELETED = \u0026#39;1\u0026#39;, DATE_UPDATED = LOCALTIMESTAMP(0) WHERE APPL_NO = $1 AND IS_DELETED = \u0026#39;0\u0026#39; AND DATE_CREATED \u0026gt; now() - interval \u0026#39;31\u0026#39; day AND DATE_CREATED \u0026lt; now() Update on TABLE_RECORD (cost=0.14..203.79 rows=39 width=2960) Update on TABLE_RECORD_202211 TABLE_RECORD_1 ... -\u0026gt; Index Scan using TABLE_RECORD_202304_date_created_idx on TABLE_RECORD_202304 TABLE_RECORD_6 (cost=0.44..5.47 rows=1 width=481) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using TABLE_RECORD_202305_date_created_idx on TABLE_RECORD_202305 TABLE_RECORD_7 (cost=0.44..5.47 rows=1 width=483) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using TABLE_RECORD_202306_date_created_idx on TABLE_RECORD_202306 TABLE_RECORD_8 (cost=0.44..5.47 rows=1 width=485) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using idx_applnodeleted_38 on TABLE_RECORD_202307 TABLE_RECORD_9 (cost=0.14..5.17 rows=1 width=3502) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) ... The May and June partitions were still using the date_created index on the partition key. The execution plan estimated only 1 row, but in reality these two partitions each had millions of rows.\nThis was very confusing — the optimizer itself could choose a better index, and EXPLAIN showed it going to that index, but the business SQL simply wasn\u0026rsquo;t using the correct index.\nUpdating Statistics # Since this was a PostgreSQL execution plan issue, the first thought was to collect statistics.\nAfter the problem occurred, we collected statistics for both the parent partitioned table and child partitions. Concerned that sessions might have cached the execution plan (plan_cache_mode=auto), we killed all sessions that connected before the statistics collection.\nThe logs still showed the SQL taking 10 seconds, indicating it wasn\u0026rsquo;t a statistics issue.\nAt this point the problem remained unsolved. We seemed to have exhausted all options.\nRoot Cause # Earlier, when analyzing execution plans, the DBA\u0026rsquo;s EXPLAIN output differed from the application\u0026rsquo;s execution plan. However, we had been executing everything as the PostgreSQL superuser. We switched to the application user and ran EXPLAIN again — the execution plan matched what was in the logs!\nSince we had previously encountered issues with native partitioned table permissions causing abnormal execution plans, we immediately checked partition permissions.\nParent table permissions:\n## \\dp+ TABLE_RECORD Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies --------+--------------------------+-------------------+-------------------------------------+-------------------+---------- public | TABLE_RECORD | partitioned table | lzldbdata=arwdDxt/lzldbdata +| | | | | r_lzldbdata_qry=r/lzldbdata +| | | | | r_lzldbdata_dml=arwd/lzldbdata +| | (1 row) Child partition permissions:\n## \\dp+ TABLE_RECORD_202505 Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies --------+---------------------------------+-------+------------------------------------+-------------------+---------- public | TABLE_RECORD_202505 | table | lzldbdata=arwdDxt/lzldbdata +| | The partition permissions were missing the r_lzldbdata_dml role, which is granted to the business user.\nWe immediately granted the permissions, and the problem was resolved:\ngrant select,update,delete,insert on TABLE_RECORD_202305 to r_lzldbdata_dml; grant select,update,delete,insert on TABLE_RECORD_202306 to r_lzldbdata_dml; After switching to the opr user again and running EXPLAIN, the execution plan was correct — the May and June partitions used the proper index:\n\\c - lzldbopr\n-\u0026gt; Index Scan using TABLE_RECORD_202303_date_created_idx on TABLE_RECORD_202303 TABLE_RECORD_5 (cost=0.44..5.47 rows=1 width=482) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using TABLE_RECORD_202304_date_created_idx on TABLE_RECORD_202304 TABLE_RECORD_6 (cost=0.44..5.47 rows=1 width=481) Index Cond: ((date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day)) AND (date_created \u0026lt; now())) Filter: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) -\u0026gt; Index Scan using idx_applnodeleted_25 on TABLE_RECORD_202305 TABLE_RECORD_7 (cost=0.43..30.39 rows=1 width=483) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using idx_applnodeleted_14 on TABLE_RECORD_202306 TABLE_RECORD_8 (cost=0.56..42.57 rows=17 width=485) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using idx_applnodeleted_38 on TABLE_RECORD_202307 TABLE_RECORD_9 (cost=0.14..5.17 rows=1 width=3502) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using idx_applnodeleted_1 on TABLE_RECORD_202308 TABLE_RECORD_10 (cost=0.14..5.17 rows=1 width=3502) Index Cond: (((appl_no)::text = \u0026#39;LZLMATH20230132302302\u0026#39;::text) AND ((is_deleted)::text = \u0026#39;0\u0026#39;::text)) Filter: ((date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) No more slow UPDATE statements were observed in the PostgreSQL logs.\nTesting (Not Reproduced) # Initial table creation script:\n-- Switch to non-superuser \\c - lzldbdata -- create table CREATE TABLE PUBLIC.LZLPARTITION ( APPL_NO varchar(100) NULL, IS_DELETED varchar(8) NULL, DATE_CREATED timestamp NOT NULL DEFAULT now(), DATE_UPDATED timestamp NOT NULL DEFAULT now() ) PARTITION BY RANGE(DATE_CREATED); -- indexes create index DATE_LZLPARTITION on PUBLIC.LZLPARTITION (DATE_CREATED); create index NK_LZLPARTITION on PUBLIC.LZLPARTITION (APPL_NO); -- privs GRANT SELECT ON TABLE public.LZLPARTITION TO r_lzldbdata_qry; GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE public.LZLPARTITION TO r_lzldbdata_dml; -- partition create table LZLPARTITION_202301 partition of LZLPARTITION for values from (\u0026#39;2023-01-01 00:00:00\u0026#39;) to (\u0026#39;2023-02-01 00:00:00\u0026#39;); create table LZLPARTITION_202302 partition of LZLPARTITION for values from (\u0026#39;2023-02-01 00:00:00\u0026#39;) to (\u0026#39;2023-03-01 00:00:00\u0026#39;); create table LZLPARTITION_202303 partition of LZLPARTITION for values from (\u0026#39;2023-03-01 00:00:00\u0026#39;) to (\u0026#39;2023-04-01 00:00:00\u0026#39;); create table LZLPARTITION_202304 partition of LZLPARTITION for values from (\u0026#39;2023-04-01 00:00:00\u0026#39;) to (\u0026#39;2023-05-01 00:00:00\u0026#39;); create table LZLPARTITION_202305 partition of LZLPARTITION for values from (\u0026#39;2023-05-01 00:00:00\u0026#39;) to (\u0026#39;2023-06-01 00:00:00\u0026#39;); create table LZLPARTITION_202306 partition of LZLPARTITION for values from (\u0026#39;2023-06-01 00:00:00\u0026#39;) to (\u0026#39;2023-07-01 00:00:00\u0026#39;); Generate data:\ninsert into public.LZLPARTITION select n + 10, \u0026#39;N\u0026#39;, to_char(to_date(\u0026#39;2023-01-01\u0026#39;, \u0026#39;YYYY-MM-DD\u0026#39;) + (\u0026#39;\u0026#39; || n || \u0026#39; minute\u0026#39;) ::interval, \u0026#39;YYYY-MM-DD\u0026#39;)::\u0026#34;date\u0026#34;, now() from generate_series(0, 300000) n Data distribution:\nselect count(*),tableoid::regclass from lzlpartition group by 2; count | tableoid -------+--------------------- 44640 | lzlpartition_202301 40320 | lzlpartition_202302 44640 | lzlpartition_202303 43200 | lzlpartition_202304 44640 | lzlpartition_202305 43200 | lzlpartition_202306 39361 | lzlpartition_202307 Permissions not inherited:\n## \\dp+ lzlpartition Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies --------+--------------+-------------------+-------------------------------------+-------------------+---------- public | lzlpartition | partitioned table | lzldbdata=arwdDxt/lzldbdata +| | | | | r_lzldbdata_qry=r/lzldbdata +| | | | | r_lzldbdata_dml=arwd/lzldbdata | | ## \\dp+ lzlpartition_202306 Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies --------+---------------------+-------+-------------------+-------------------+---------- public | lzlpartition_202306 | table | | | Execution plan (correct):\nexplain select count(*) from lzlpartition WHERE APPL_NO = \u0026#39;217450\u0026#39; AND IS_DELETED = \u0026#39;N\u0026#39; AND DATE_CREATED \u0026gt; now() - interval \u0026#39;31\u0026#39; day AND DATE_CREATED \u0026lt; now(); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=36.76..36.77 rows=1 width=8) -\u0026gt; Append (cost=0.29..36.74 rows=7 width=0) Subplans Removed: 5 -\u0026gt; Index Scan using lzlpartition_202305_appl_no_idx on lzlpartition_202305 lzlpartition_1 (cost=0.15..5.19 rows=1 width=0) Index Cond: ((appl_no)::text = \u0026#39;217450\u0026#39;::text) Filter: (((is_deleted)::text = \u0026#39;0\u0026#39;::text) AND (date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) -\u0026gt; Index Scan using lzlpartition_202306_appl_no_idx on lzlpartition_202306 lzlpartition_2 (cost=0.15..5.19 rows=1 width=0) Index Cond: ((appl_no)::text = \u0026#39;217450\u0026#39;::text) Filter: (((is_deleted)::text = \u0026#39;0\u0026#39;::text) AND (date_created \u0026lt; now()) AND (date_created \u0026gt; (now() - \u0026#39;31 days\u0026#39;::interval day))) The permissions were still not inherited. In fact, we tested on other PostgreSQL versions and observed the same behavior — it seems to be a general behavior.\nHowever, even so, we couldn\u0026rsquo;t reproduce the issue. The test results used the correct index, unlike the production environment which used the wrong index.\nSummary # Since we had collected statistics and killed sessions, it shouldn\u0026rsquo;t have been a cached execution plan issue. After executing GRANT, the partition execution plan immediately became correct (even granting just one partition fixed that specific partition), so we are fairly confident that the partition permission issue caused the abnormal partition execution plan.\nThe analysis and resolution process can be summarized as follows:\nSwitch to the application user to view the execution plan. Using the superuser to view execution plans is a common practice, but the plan seen from the superuser may not be correct. Permissions on child partitions of partitioned tables. The root cause is that permissions on child partitions of PostgreSQL partitioned tables were inconsistent with the parent table, causing the execution plan to be abnormal. In other words, permission issues affected PostgreSQL\u0026rsquo;s execution plan. This issue is difficult to reproduce and occurs very, very rarely. Permission-caused execution plan anomalies are extremely subtle and hard to diagnose. Two questions worth deeper discussion:\nPermission issues shouldn\u0026rsquo;t affect execution plans. Why do permissions affect execution plans? Child partition permissions are inconsistent with parent table permissions. Why don\u0026rsquo;t child partitions fully inherit parent table permissions? A bug report has been submitted to see what the official team says.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/incorrect-execution-plan-caused-by-partition-permission-issues/","section":"Posts","summary":"Problem Overview # Last night, the business team updated a SQL query. Previously, the query ran very fast without the DATE_CREATED field (the partition key). After the release, the partition field was added to reduce the number of partitions accessed. However, after adding it, the UPDATE execution actually became slower.\nBefore:\nupdate TABLE_RECORD set IS_DELETED = '1', DATE_UPDATED = LOCALTIMESTAMP(0) WHERE APPL_NO = $1 AND IS_DELETED = '0' After:\nupdate TABLE_RECORD set IS_DELETED = '1', DATE_UPDATED = LOCALTIMESTAMP(0) WHERE APPL_NO = $1 AND IS_DELETED = '0' AND DATE_CREATED \u003e now() - interval '31' day AND DATE_CREATED \u003c now() Before the release, access time was in milliseconds. After the release, access time was 10 seconds. The SQL runs frequently, and the business found this unacceptable.\n","title":"Incorrect Execution Plan Caused by Partition Permission Issues","type":"posts"},{"content":" As a DBA # Since early 2023, I set my main task for the year — learn the PostgreSQL database. Though I didn\u0026rsquo;t set detailed plans, the overall goal was to finish learning some foundational PostgreSQL knowledge. Later I found I had oversimplified things — the cost of learning PostgreSQL was far greater than I imagined, and I didn\u0026rsquo;t achieve this goal in 2023. For example, the PostgreSQL transaction chapter: I thought I could finish it in 2 weeks, but it took me about 2 months. Regardless, persistent learning did yield some results:\nAmong them, the optimizer chapter was actually not completed. Though I\u0026rsquo;m guilty, I still need to explain. The optimization chapter has been in progress for over two months — not because I was slacking off, but because it\u0026rsquo;s simply impossible to finish. It has already reached Typora\u0026rsquo;s text limit — around 8000 characters it starts lagging, so I had to passively split it into parts. It\u0026rsquo;s already split to Part 4:\nEven so, the optimization chapter is probably less than half done. I can only shamelessly carry it over to the next year\u0026hellip; Personally, I think another 4 months should let me complete the optimization chapter\u0026hellip; Even then, the priority needs to be pushed back — there\u0026rsquo;s really not enough time!\nREADING # My main profession is databases, so I should spend time on databases, and extracurricular reading should take a back seat. However, I still don\u0026rsquo;t want to give up this part, for three reasons I think:\nThe value brought by reading is immeasurable in the short term Reading brings a pleasant sense of intellectual enrichment I use fragmented time to read, only spending 2-3 hours writing reading notes, which doesn\u0026rsquo;t take up too much study time I\u0026rsquo;ve certainly read some PostgreSQL technical books, but I read them with a targeted approach. For example, for optimization, I\u0026rsquo;d bring together \u0026ldquo;The Internals of PostgreSQL,\u0026rdquo; \u0026ldquo;PostgreSQL Technical Internals: Query Optimization Deep Dive,\u0026rdquo; \u0026ldquo;PostgreSQL Query Engine Source Code Technical Analysis,\u0026rdquo; and \u0026ldquo;The Art of Database Query Optimizer\u0026rdquo; to study a particular knowledge point together. I wasn\u0026rsquo;t focused on whether I\u0026rsquo;d finish them, and I didn\u0026rsquo;t read them cover-to-cover in order. So the reading list here only covers extracurricular books.\n2023 Extracurricular Reading List (ranked by preference):\n\u0026ldquo;Homo Deus\u0026rdquo; \u0026ldquo;Romance of the Three Kingdoms\u0026rdquo; The \u0026ldquo;Space Odyssey\u0026rdquo; series: 2001, 2010, 2060, 3001 \u0026ldquo;Elon Musk\u0026rdquo; \u0026ldquo;Chimpanzee Politics\u0026rdquo; \u0026ldquo;Goodbye, the Age of Mediocrity\u0026rdquo; \u0026ldquo;Wild\u0026rdquo; \u0026ldquo;Are We Smart Enough to Know How Smart Animals Are?\u0026rdquo; \u0026ldquo;To Kill a Mockingbird\u0026rdquo; \u0026ldquo;Rich Dad Poor Dad\u0026rdquo; \u0026ldquo;When Breath Becomes Air\u0026rdquo; \u0026ldquo;The Metamorphosis,\u0026rdquo; \u0026ldquo;The Judgment,\u0026rdquo; \u0026ldquo;A Hunger Artist\u0026rdquo; and other Kafka short stories Not great: \u0026ldquo;What Life Could Mean to You,\u0026rdquo; \u0026ldquo;How to Win Friends and Influence People,\u0026rdquo; \u0026ldquo;The Courage to Be Disliked\u0026rdquo;\nBlog and WeChat Official Account # I publish articles through two channels:\nCSDN Blog: https://liuzhilong.blog.csdn.net WeChat Official Account: liuzhilong62 I\u0026rsquo;ve been persisting with blogging for many years. The big change in 2023 was mainly writing about PostgreSQL and increasing technical depth. The WeChat Official Account is a new venture I started this year, and it was a major experiment in 2023. Both blogs and official accounts can be used for technical sharing, but their audiences are somewhat different. A blog can serve as a technical accumulation, while an official account is more like a technical news feed. There are many big names in the community who publish daily (even multiple times a day) — I greatly admire that. But there are also big names who focus on quality articles without worrying about daily posting. I personally prefer the latter approach — learning a domain\u0026rsquo;s knowledge roughly in one go, which feels more holistic and targeted. Often I split longer articles into parts for the official account (I don\u0026rsquo;t even like reading overly long articles myself). On my blog I don\u0026rsquo;t split them, so readers interested in a particular article can search for it on CSDN — it\u0026rsquo;s easier to read there.\nWhy write?\nSelf-learning value Technical research value Dissemination value The efficiency of active learning far exceeds passive learning, just like this learning pyramid (image from \u0026ldquo;Rich Dad Poor Dad\u0026rdquo; — the value of extracurricular reading!):\nOpportunities like hands-on practice and presentations are rare and hard to come by. Outputting what you\u0026rsquo;ve learned as articles greatly improves your understanding of knowledge points. Reading an article might take just ten minutes, but producing it as an article may take more than ten times that long.\nThis year I also tried doing pure translation-style technical articles. Although the technical research value isn\u0026rsquo;t high, there\u0026rsquo;s still learning value and dissemination value. Reading something once versus translating it once leads to different levels of understanding, just like what I said above: active learning. However, what bothers me a bit now is: previously, for things I couldn\u0026rsquo;t understand, I\u0026rsquo;d use Google Translate for a rough pass and then polish it myself. Now with GPT, it can translate an entire article and I barely need to change any words or sentences. The active learning value has been severely diluted — the AI is doing all the learning\u0026hellip;\nMy writing style changed significantly in 2023. I wrote about various things and tried everything. Of course, I know one should focus on vertical content, but I still couldn\u0026rsquo;t resist doing random things — I haven\u0026rsquo;t even settled on a name for my official account yet. Currently, what\u0026rsquo;s clear is: technical articles and extracurricular reading notes, with technical articles as the main focus. Other types of articles probably won\u0026rsquo;t be written anymore. Whether I\u0026rsquo;ll adjust later, I don\u0026rsquo;t know. At least the official account still has room for adjustment. Anyway, let\u0026rsquo;s keep it like this — launch first, adjust later.\n2023 blog statistics are hard to track now. I can only provide blog data from 2017 to 2023 as a snapshot.\nCSDN Blog:\nWeChat Official Account followers:\nFinal Thoughts # The biggest realization of 2023 — time. There\u0026rsquo;s really not enough time!\nOn June 17, 2023, I participated in the PostgreSQL Database Technology Summit Chengdu stop and shared my fresh, hot-off-the-press PostgreSQL transaction knowledge with the experts. It was my first time on stage and I was quite nervous. I must thank Boss Can for the opportunity. There was a small episode during this sharing that shows how pressed for time I was in 2023. I also had part-time graduate studies — the day of the sharing was also my final exam day. After finishing my talk, I rushed straight to the airport\u0026hellip; In the end, I missed 3 exams and had to retake them\u0026hellip; It was too hard.\nI\u0026rsquo;ve completely given up on work-life balance — having a work-learning balance would be good enough. Every day after work I don\u0026rsquo;t think about resting but about going home to study. In the end, there were still many things unfinished, left to my 2024 self.\nExpectations for 2024:\nComplete my thesis and graduate smoothly Finish the PostgreSQL optimization section We\u0026rsquo;ll see about the rest ","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/my-2023-year-end-summary/","section":"Posts","summary":"As a DBA # Since early 2023, I set my main task for the year — learn the PostgreSQL database. Though I didn’t set detailed plans, the overall goal was to finish learning some foundational PostgreSQL knowledge. Later I found I had oversimplified things — the cost of learning PostgreSQL was far greater than I imagined, and I didn’t achieve this goal in 2023. For example, the PostgreSQL transaction chapter: I thought I could finish it in 2 weeks, but it took me about 2 months. Regardless, persistent learning did yield some results:\n","title":"My 2023 Year-End Summary","type":"posts"},{"content":" Problem Analysis # When executing SQL in a PostgreSQL database, ORDER BY LIMIT 10 runs slower than ORDER BY LIMIT 100.\nExecution Plan Analysis # SELECT *, (select cl.ITEM_DESC from tablelzl2 cl where item_name=\u0026#39;name\u0026#39; and cl.ITEM_NO=\u0026#39;abcdefg\u0026#39;) AS \u0026#34;item\u0026#34; FROM tablelzl1 RI WHERE RI.column1=\u0026#39;AAAA\u0026#39; AND RI.column2 = \u0026#39;applyno20231112\u0026#39; ORDER BY RI.column3 DESC limit 10 Limit (cost=0.43..1522.66 rows=10 width=990) -\u0026gt; Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri (cost=0.43..158007.45 rows=1038 width=990) Filter: (((column1)::text = \u0026#39;AAAA\u0026#39;::text) AND ((column2)::text = \u0026#39;applyno20231112\u0026#39;::text)) SubPlan 1 -\u0026gt; Index Scan using uk_tablelzl2_ii on tablelzl2 cl (cost=0.27..5.29 rows=1 width=18) Index Cond: (((item_no)::text = \u0026#39;manualSign\u0026#39;::text) AND ((item_name)::text = (ri.manual_sign)::text)) The main table does not use the column2 index. Instead it uses an Index Scan Backward on the column3 sort index. The scan cost for the index is very high, yet the final cost looks low. Actual execution takes 9 seconds.\nChanging LIMIT 10 to LIMIT 100 yields a normal execution plan:\nSELECT *, (select cl.ITEM_DESC from tablelzl2 cl where cl.ITEM_NAME = RI.MANUAL_SIGN AND cl.ITEM_NO=\u0026#39;manualSign\u0026#39;) AS \u0026#34;manualSign\u0026#34; FROM tablelzl1 RI WHERE RI.column1=\u0026#39;AAAA\u0026#39; AND RI.column2 = \u0026#39;applyno20231112\u0026#39; ORDER BY RI.column3 DESC limit 100 QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Limit (cost=2632.28..3162.78 rows=100 width=990) -\u0026gt; Result (cost=2632.28..8138.87 rows=1038 width=990) -\u0026gt; Sort (cost=2632.28..2634.87 rows=1038 width=474) Sort Key: ri.column3 DESC -\u0026gt; Index Scan using idx_cri_column2 on tablelzl1 ri (cost=0.43..2592.61 rows=1038 width=474) Index Cond: ((column2)::text = \u0026#39;applyno20231112\u0026#39;::text) Filter: ((column1)::text = \u0026#39;AAAA\u0026#39;::text) SubPlan 1 -\u0026gt; Index Scan using uk_tablelzl2_ii on tablelzl2 cl (cost=0.27..5.29 rows=1 width=18) Index Cond: (((item_no)::text = \u0026#39;manualSign\u0026#39;::text) AND ((item_name)::text = (ri.manual_sign)::text)) (10 rows) The subquery plan remains unchanged. The main table now uses the column2 single-column index, fetches rows, sorts, then applies LIMIT — execution is extremely fast.\nThis is not just about LIMIT values — changing only the column2 value in the original SQL can also produce a normal plan. In practice, only a few specific column2 values trigger the abnormal plan.\nExecution plan comparison:\ncolumn2 is a filter column, column3 is a sort column. The two plans choose different indexes:\nAbnormal LIMIT 10 plan: Backward scan sort-column index → fetch rows → limit. No extra sort needed; scanning backward, it can stop as soon as it finds enough rows matching the LIMIT. The estimated cost of scanning the sort-column index is very high, but the top-level LIMIT cost estimate is very low. Normal LIMIT 100 plan: Access filter-column index → fetch rows → sort by sort column → limit. Because sorting is required, all matching index entries must be retrieved. The filter-column index scan itself has a low cost estimate. So the key issue is: the optimizer underestimates the cost of a partial backward scan on the sort index.\nActual Execution # Let\u0026rsquo;s look at explain (analyze,buffers):\n--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..1521.93 rows=10 width=990) (actual time=23.311..8122.516 rows=10 loops=1) Buffers: shared hit=861100 read=42985 dirtied=7 I/O Timings: read=6741.003 -\u0026gt; Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri (cost=0.43..157932.45 rows=1038 width=990) (actual time=23.309..8122.505 rows=10 loops=1) Filter: (((column1)::text = \u0026#39;AAAA\u0026#39;::text) AND ((column2)::text = \u0026#39;applyno20231112\u0026#39;::text)) Rows Removed by Filter: 1521796 Buffers: shared hit=861100 read=42985 dirtied=7 I/O Timings: read=6741.003 SubPlan 1 -\u0026gt; Index Scan using uk_tablelzl2_ii on tablelzl2 cl (cost=0.27..5.29 rows=1 width=18) (actual time=0.005..0.005 rows=0 loops=10) Index Cond: (((item_no)::text = \u0026#39;manualSign\u0026#39;::text) AND ((item_name)::text = (ri.manual_sign)::text)) Buffers: shared hit=6 Planning: Buffers: shared hit=121 read=28 I/O Timings: read=1.476 Planning Time: 2.314 ms Execution Time: 8122.658 ms Limit (cost=2632.28..3162.78 rows=100 width=990) (actual time=150.101..150.122 rows=14 loops=1) Buffers: shared hit=700 read=274 I/O Timings: read=146.903 -\u0026gt; Result (cost=2632.28..8138.87 rows=1038 width=990) (actual time=150.100..150.119 rows=14 loops=1) Buffers: shared hit=700 read=274 I/O Timings: read=146.903 -\u0026gt; Sort (cost=2632.28..2634.87 rows=1038 width=474) (actual time=150.072..150.073 rows=14 loops=1) Sort Key: ri.column3 DESC Sort Method: quicksort Memory: 30kB Buffers: shared hit=694 read=274 I/O Timings: read=146.903 -\u0026gt; Index Scan using idx_cri_column2 on tablelzl1 ri (cost=0.43..2592.61 rows=1038 width=474) (actual time=0.418..149.973 rows=14 loops=1) Index Cond: ((column2)::text = \u0026#39;applyno20231112\u0026#39;::text) Filter: ((column1)::text = \u0026#39;AAAA\u0026#39;::text) Rows Removed by Filter: 1218 Buffers: shared hit=691 read=274 I/O Timings: read=146.903 SubPlan 1 -\u0026gt; Index Scan using uk_tablelzl2_ii on tablelzl2 cl (cost=0.27..5.29 rows=1 width=18) (actual time=0.002..0.002 rows=0 loops=14) Index Cond: (((item_no)::text = \u0026#39;manualSign\u0026#39;::text) AND ((item_name)::text = (ri.manual_sign)::text)) Buffers: shared hit=6 Planning Time: 0.334 ms Execution Time: 150.257 ms The LIMIT 10 plan executes in 8 seconds: shared hit=861,100, disk read=42,985, 1,521,796 rows removed by filter.\nThe LIMIT 100 plan executes in 0.15 seconds: shared hit=694, read=274, 1,218 rows removed.\nThe LIMIT 10 plan is clearly abnormal — it reads far too many rows before finding qualifying ones, which is why the query is slow.\nStatistics Analysis # The estimated cost is low, but the actual scan touches many index rows. First, check whether the statistics are accurate.\nTable statistics:\n[postgres@cnsz381785:7169/(rasesql)phmamp][10-30.15:01:26]M=# select relpages,reltuples::bigint from pg_class where relname=\u0026#39;tablelzl1\u0026#39;; relpages | reltuples ----------+----------- 91172 | 2280874 -- roughly matches actual count Column statistics:\n[phmampopr@cnsz381785:7169/(rasesql)phmamp][10-27.17:08:48]M=\u0026gt; select * from pg_stats where tablename=\u0026#39;tablelzl1\u0026#39; and attname=\u0026#39;column2\u0026#39;; -[ RECORD 1 ]----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- schemaname | public tablename | tablelzl1 attname | column2 inherited | f null_frac | 0 avg_width | 18 n_distinct | -0.11990886 most_common_vals | {applyno20231112,DY20190723006650,DY20200102012899,DY20180827000557,DY20190524001304,DY20190529001885,DY20190728002359} most_common_freqs | {0.0005,0.00026666667,0.00023333334,0.0002,0.0002,0.0002,0.0002} histogram_bounds | {CULZF0000121605605,DSNEW0000126854232,DSNEW0000137652871,DY20160516001057,DY20161104005509,DY20170306002677,DY20170703010428,DY20170928013517,DY20180410007383,DY20180615002936,DY20180 correlation | 0.3131596 most_common_elems | [null] most_common_elem_freqs | [null] elem_count_histogram | [null] The value applyno20231112 happens to be the top most_common_vals, with an estimated frequency of 0.0005. Multiplying: 2,280,874 × 0.0005 = 1,140, which is close to the real count of 1,232.\n[postgres@cnsz381785:7169/(rasesql)phmamp][10-30.15:05:28]M=# select count(*) from tablelzl1 where column2 = \u0026#39;applyno20231112\u0026#39;; count ------- 1232 Statistics are accurate. Running ANALYZE to recollect statistics would not fix this.\nThe Effect of Uneven Data Distribution # Using the current statistics, the estimated number of matching rows is ~1,140. On average, finding the first matching row through the sort-column index would require scanning 2,280,874 / 1,140 ≈ 2,000 index entries. For 10 rows, about 20,000 entries; for 100 rows, about 200,000 entries.\nLet\u0026rsquo;s disable sort and force the LIMIT 100 statement to use the sort-column index:\nM=# set enable_sort=off; SET --limit 100 execution plan Limit (cost=0.43..15222.69 rows=100 width=990) -\u0026gt; Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri (cost=0.43..158007.45 rows=1038 width=990) Filter: (((column1)::text = \u0026#39;AAAA\u0026#39;::text) AND ((column2)::text = \u0026#39;applyno20231112\u0026#39;::text)) SubPlan 1 -\u0026gt; Index Scan using uk_tablelzl2_ii on tablelzl2 cl (cost=0.27..5.29 rows=1 width=18) Index Cond: (((item_no)::text = \u0026#39;manualSign\u0026#39;::text) AND ((item_name)::text = (ri.manual_sign)::text)) When LIMIT 10 becomes LIMIT 100, the cost jumps from 1522.66 to 15222.69 — roughly a ×10 multiplication. The LIMIT 100 cost of 15222.69 now exceeds the filter-column index plan\u0026rsquo;s cost of 3162.78, so the optimizer switches indexes.\nThe above estimates all assume data is evenly scattered across the sort-column index. In reality, the data could be at the very end (backward scan finds it quickly), or all concentrated in the first few leaf pages (requiring nearly a full index scan + fetch), making the cost extremely high.\nThe correlation between the two columns — how the data is distributed across the index — determines whether using the sort-column index is efficient.\nLet\u0026rsquo;s look at how many rows were actually scanned:\n-\u0026gt; Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri (cost=0.43..157932.45 rows=1038 width=990) (actual time=23.309..8122.505 rows=10 loops=1) Filter: (((column1)::text = \u0026#39;AAAA\u0026#39;::text) AND ((column2)::text = \u0026#39;applyno20231112\u0026#39;::text)) Rows Removed by Filter: 1521796 In reality, about 1,521,796 rows were scanned to find just 10 matching rows. The estimate was 20,000 — a 76× discrepancy!\nTrigger Conditions # Must involve WHERE + ORDER BY + LIMIT clauses Both the sort column and filter column must have indexes The LIMIT value is typically not very large Uneven data distribution Solution # Rewrite the SQL: add an expression to prevent the ORDER BY column from using its index.\nSELECT *, (select cl.ITEM_DESC from tablelzl2 cl where cl.ITEM_NAME = RI.MANUAL_SIGN AND cl.ITEM_NO=\u0026#39;manualSign\u0026#39;) AS \u0026#34;manualSign\u0026#34; FROM tablelzl1 RI WHERE RI.column1=\u0026#39;AAAA\u0026#39; AND RI.column2 = \u0026#39;applyno20231112\u0026#39; ORDER BY RI.column3 +\u0026#39;0\u0026#39; DESC limit 10 How Oracle Handles This # Cost Estimation Differences in Execution Plans # From the analysis above, the PostgreSQL execution plan\u0026rsquo;s cost looks unbalanced — the upper-level cost is lower than the inner-level cost, unlike Oracle\u0026rsquo;s hierarchical accumulation.\nLet\u0026rsquo;s run an experiment: a table containing only rows where colname='x', comparing how PostgreSQL and Oracle calculate costs:\n[postgres@cnsz381785:7169/(rasesql)dbmgr][10-31.14:32:19]M=# explain select * from testlzl where col1=\u0026#39;x\u0026#39; limit 1; QUERY PLAN ----------------------------------------------------------------------- Limit (cost=0.00..0.02 rows=1 width=2) -\u0026gt; Seq Scan on testlzl (cost=0.00..17747.20 rows=1048576 width=2) Filter: ((col1)::text = \u0026#39;x\u0026#39;::text) [postgres@cnsz381785:7169/(rasesql)dbmgr][10-31.14:32:30]M=# explain select * from testlzl where col1=\u0026#39;xx\u0026#39; limit 1; QUERY PLAN ----------------------------------------------------------------- Limit (cost=0.00..17747.20 rows=1 width=2) -\u0026gt; Seq Scan on testlzl (cost=0.00..17747.20 rows=1 width=2) Filter: ((col1)::text = \u0026#39;xx\u0026#39;::text) When col1='x', the row is found immediately, but the LIMIT cost is not pushed down into the seq scan cost — the total cost is 17747.20, the same as scanning the whole table. The LIMIT cost is not pushed into the inner node\u0026rsquo;s cost, but the rows estimate is.\nNow let\u0026rsquo;s see how Oracle handles the same case:\nSYS@t8icss1\u0026gt; select * from dbmgr.testlzl where a=\u0026#39;x\u0026#39; and rownum\u0026lt;=1; 1 row selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2045386539 ------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 2 | 2 (0)| 00:00:01 | |* 1 | COUNT STOPKEY | | | | | | |* 2 | TABLE ACCESS FULL| TESTLZL | 1 | 2 | 2 (0)| 00:00:01 | ------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM\u0026lt;=1) 2 - filter(\u0026#34;A\u0026#34;=\u0026#39;x\u0026#39;) SYS@t8icss1\u0026gt; select * from dbmgr.testlzl where a=\u0026#39;xx\u0026#39; and rownum\u0026lt;=1; no rows selected Execution Plan ---------------------------------------------------------- Plan hash value: 2045386539 ------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 2 | 302 (2)| 00:00:01 | |* 1 | COUNT STOPKEY | | | | | | |* 2 | TABLE ACCESS FULL| TESTLZL | 1 | 2 | 302 (2)| 00:00:01 | ------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM\u0026lt;=1) 2 - filter(\u0026#34;A\u0026#34;=\u0026#39;xx\u0026#39;) In Oracle, when a='x' is found immediately, the STOPKEY cost is pushed into the inner node — cost is only 2. When the data doesn\u0026rsquo;t exist (a='xx'), the full scan cost is 302.\nThis is an important difference between Oracle and PostgreSQL cost calculation:\nIn Oracle, the outer node cost is always ≥ the inner node cost; in PostgreSQL, this is not guaranteed. Oracle\u0026rsquo;s inner node cost incorporates outer operators (e.g., STOPKEY); PostgreSQL does not — it gives the full cost of the child path. Oracle and Uneven Data Distribution # Knowing the principle, we can reproduce the issue by placing data at the beginning of the sort index:\ncreate table tlzl(a char(100) not null,b char(100) not null); --Insert bulk data begin for i in 1..100000 loop insert into tlzl values(\u0026#39;test\u0026#39;,\u0026#39;test\u0026#39;); end loop; end; / --Insert special data insert into tlzl values(\u0026#39;aaaa\u0026#39;,\u0026#39;aaaa\u0026#39;); insert into tlzl values(\u0026#39;zzzz\u0026#39;,\u0026#39;zzzz\u0026#39;); --Create indexes create index idx_a on tlzl(a); create index idx_b on tlzl(b); --Collect statistics EXEC DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=\u0026gt;\u0026#39;SYS\u0026#39;,TABNAME=\u0026gt;\u0026#39;TLZL\u0026#39;,estimate_percent =\u0026gt; 10, degree=\u0026gt;1,METHOD_OPT=\u0026gt;\u0026#39;FOR ALL COLUMNS SIZE AUTO\u0026#39;,cascade=\u0026gt;true); select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b=\u0026#39;aaaa\u0026#39; order by a) where rownum\u0026lt;=1; select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b=\u0026#39;zzzz\u0026#39; order by a) where rownum\u0026lt;=1; SYS@t8icss1\u0026gt; select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b=\u0026#39;aaaa\u0026#39; order by a) where rownum\u0026lt;=1; Execution Plan ---------------------------------------------------------- Plan hash value: 3674066029 --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 204 | 2210 (1)| 00:00:01 | |* 1 | COUNT STOPKEY | | | | | | | 2 | VIEW | | 1 | 204 | 2210 (1)| 00:00:01 | |* 3 | TABLE ACCESS BY INDEX ROWID| TLZL | 1 | 202 | 2210 (1)| 00:00:01 | | 4 | INDEX FULL SCAN | IDX_A | 98830 | | 779 (1)| 00:00:01 | --------------------------------------------------------------------------------------- SYS@t8icss1\u0026gt; select * from (select /*+ index(tlzl idx_a)*/* from tlzl where b=\u0026#39;zzzz\u0026#39; order by a) where rownum\u0026lt;=1; Execution Plan ---------------------------------------------------------- Plan hash value: 3674066029 --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 204 | 2210 (1)| 00:00:01 | |* 1 | COUNT STOPKEY | | | | | | | 2 | VIEW | | 1 | 204 | 2210 (1)| 00:00:01 | |* 3 | TABLE ACCESS BY INDEX ROWID| TLZL | 1 | 202 | 2210 (1)| 00:00:01 | | 4 | INDEX FULL SCAN | IDX_A | 98830 | | 779 (1)| 00:00:01 | --------------------------------------------------------------------------------------- Oracle\u0026rsquo;s optimizer has the same limitation — it doesn\u0026rsquo;t know where the data actually sits within the index. Whether the data is at the first or last position in the index, the estimated cost is the same.\nHowever, Oracle provides more tools to address this: extended statistics, Automatic Column Group Detection, plan baselines, etc.\nReferences # http://www.postgres.cn/v2/news/viewone/1/717 https://oracle-base.com/articles/12c/automatic-column-group-detection-extended-statistics-12cr1\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/order-by-limit-10-slower-than-order-by-limit-100/","section":"Posts","summary":"Problem Analysis # When executing SQL in a PostgreSQL database, ORDER BY LIMIT 10 runs slower than ORDER BY LIMIT 100.\nExecution Plan Analysis # SELECT *, (select cl.ITEM_DESC from tablelzl2 cl where item_name='name' and cl.ITEM_NO='abcdefg') AS \"item\" FROM tablelzl1 RI WHERE RI.column1='AAAA' AND RI.column2 = 'applyno20231112' ORDER BY RI.column3 DESC limit 10 Limit (cost=0.43..1522.66 rows=10 width=990) -\u003e Index Scan Backward using idx_tablelzl1_column3 on tablelzl1 ri (cost=0.43..158007.45 rows=1038 width=990) Filter: (((column1)::text = 'AAAA'::text) AND ((column2)::text = 'applyno20231112'::text)) SubPlan 1 -\u003e Index Scan using uk_tablelzl2_ii on tablelzl2 cl (cost=0.27..5.29 rows=1 width=18) Index Cond: (((item_no)::text = 'manualSign'::text) AND ((item_name)::text = (ri.manual_sign)::text)) The main table does not use the column2 index. Instead it uses an Index Scan Backward on the column3 sort index. The scan cost for the index is very high, yet the final cost looks low. Actual execution takes 9 seconds.\n","title":"ORDER BY LIMIT 10 Slower Than ORDER BY LIMIT 100","type":"posts"},{"content":"​ Vacation # I took a long vacation and went back to my hometown before my leave days expired — not just to escape the busyness of work, but also to visit my grandparents. For working people like us, going back to our hometown is really difficult. If it\u0026rsquo;s just a weekend trip, we\u0026rsquo;d only get one day of rest before having to head back — too exhausting. We don\u0026rsquo;t get many vacation days to begin with, and when we do, most people think about driving out to see some scenery or just staying home for a few days doing nothing. No one usually thinks of using their precious leave to visit elderly relatives back home.\nIronically, the leave I used to visit my grandparents was childcare leave, not some kind of \u0026ldquo;eldercare leave.\u0026rdquo; It seems the world doesn\u0026rsquo;t have such a thing as \u0026ldquo;eldercare leave\u0026rdquo; — only family visit leave. Although there is legally a \u0026ldquo;family visit leave\u0026rdquo; provision, never mind that it isn\u0026rsquo;t specifically designed for visiting the elderly — just look at those impossibly long qualifiers. For the vast majority of people, family visit leave essentially doesn\u0026rsquo;t exist.\nUsing childcare leave not to care for children but to visit the elderly — I imagine most people wouldn\u0026rsquo;t do that. Am I the only oddball who would? Well, at least this is how I see it: raising children and caring for the elderly are equally important; we shouldn\u0026rsquo;t favor one over the other. Society and working people tend to prioritize the former. Regardless, I still wanted to go back and spend time with them, to see what the old couple does every day, how they live, whether they face any difficulties, and how they cope with those difficulties. So I went back, alone.\nThe end of the road. The place where the old couple lives is where I grew up. It\u0026rsquo;s quite hidden — you have to turn off the main road onto a mountain path and go a long way, all the way to the end. It feels like a place cut off from the world. When you arrive there, it\u0026rsquo;s as if all connection to the outside world ceases to exist.\nIt\u0026rsquo;s not actually my ancestral hometown, but I prefer to call it that. It\u0026rsquo;s a mining area. Because it\u0026rsquo;s built into the mountainside, the mine has a striking three-dimensional quality — so much so that I\u0026rsquo;m in awe of the predecessors who designed it. I still don\u0026rsquo;t quite know how to describe the administrative level of this place. It\u0026rsquo;s not a village, not a town — more modern than a village but smaller than a town. When I was little I thought this place was huge; now I realize you can walk through the entire mining area in just ten minutes.\nThe whole place relies on coal mining as its economic pillar. It once prospered, but now it has declined significantly. There are still miners who go underground, but in the living quarters, you no longer see young people like me. The mine has an elementary school; when I attended, there were about 70 students per grade. Now there are only seven.\nThe childhood memories there are overwhelmingly strong — like a paradise, a sanctuary untouched by worldly strife, another world. Being far from modern society, you only need the basics to get by, and time seems to pass slowly. A place like this is indeed very suitable for retirement — and indeed, there are many elderly people here.\nFood # When I was little, the market was fairly lively. I remember the poultry vendor would submerge whole chickens in something black and tar-like before plucking them — the poultry area was always filthy. Now the market no longer sells fresh meat; you can only buy vegetables grown by nearby farmers. If you want fresh meat, you have to go to the village market day or take a bus into the city.\nBecause the old couple is extremely frugal, I was initially worried they lived too simply — maybe just rice and vegetables every day. When I went back this time, I didn\u0026rsquo;t tell them exactly when I\u0026rsquo;d arrive. When I got home, I found they had even bought braised duck — I was quite relieved. My return made them very happy, and with just the three of us, they made five or six dishes every day. I even started to wonder if I was there to keep them company or to cause them trouble.\nMaybe I\u0026rsquo;ve been spoiled by the rich flavors of the outside world. At first, when they asked, \u0026ldquo;Is this dish good?\u0026rdquo; I couldn\u0026rsquo;t bring myself to say what I really thought. At moments like this, I recall a line from some book: \u0026ldquo;Humans cannot directly judge the value of something; only by comparing it to something else do they know its worth.\u0026rdquo; The same goes for food. When you taste something for the first time, you don\u0026rsquo;t actually know if it\u0026rsquo;s good or not. If you do know, it must be because you\u0026rsquo;ve already compared it to something in your memory. When I was little and first tried hotpot, adults would always ask, \u0026ldquo;Is this hotpot good?\u0026rdquo; To be honest, I had no idea — I didn\u0026rsquo;t even know what \u0026ldquo;good\u0026rdquo; was supposed to taste like. I just ate.\nNow my palate has indeed grown more demanding, but here, I wanted to reset everything, to press that \u0026ldquo;restore factory settings\u0026rdquo; button. I can say with complete sincerity: what they cook is delicious.\nOne more thing: at one point I offered to wash the dishes. They said, \u0026ldquo;Put them down, you don\u0026rsquo;t know how — we wash dishes with rice water. You wouldn\u0026rsquo;t get them clean. Dish soap is full of chemicals; we don\u0026rsquo;t use that stuff.\u0026rdquo;\nTraditional Chinese Medicine # On this trip, I discovered a fact: elderly people are extremely dependent on medication. Their medicine cabinets are always stuffed with all kinds of drugs — Western medicine, Chinese medicine, cold medicine, anti-inflammatory drugs, ointments, supplements — a whole pile. Whenever they feel something wrong with their body, they reach for whatever they think will help. During my visit, my rhinitis flared up (an old problem of mine) — nonstop sneezing and runny nose. They kept urging me to take cold medicine, recommending Ganmaoling or cephalosporin. I must have said at least ten times: \u0026ldquo;It\u0026rsquo;s rhinitis, not a cold.\u0026rdquo; They, of course, had no idea how to treat this kind of rhinitis, so they just kept urging me to take cold medicine.\nOne day I took them into the city. Besides a supermarket run, the more important errand was buying medicine. Buying medicine meant both Chinese and Western.\nThe Chinese medicine was purchased at a Yunnan herbal shop. The shop owner had a buzz cut, a black T-shirt, a silver necklace, and a brown beaded bracelet — he looked quite burly. With his tough-guy appearance, I didn\u0026rsquo;t even dare to speak loudly to him, though my grandfather didn\u0026rsquo;t seem to notice any of that. The shop was mostly filled with herbs I couldn\u0026rsquo;t name, sold by weight, quite expensive — not your typical Chinese medicine. Clearly, my grandfather was a regular customer; the owner knew him. But it seemed my grandfather didn\u0026rsquo;t really know how to pick herbs either: \u0026ldquo;Boss, just weigh me 300 yuan\u0026rsquo;s worth based on my health condition.\u0026rdquo; So the owner grabbed a bit from here, a bit from there, and finally ground everything into powder.\nMaybe I\u0026rsquo;ve studied too much — I\u0026rsquo;ve always been skeptical of traditional Chinese medicine, simply because I find it lacks convincing rationale. I was quite worried they\u0026rsquo;d get scammed; these herbal medicine dealers prey specifically on the elderly. But my grandfather said: \u0026ldquo;Before, your grandmother had constant headaches. After taking this medicine, the headaches stopped.\u0026rdquo; So it seemed to work. Western medicine is indeed far too unfriendly to the elderly.\nWestern Medicine # After buying the Chinese medicine, we walked a long way to a pharmacy to buy Western medicine. That pharmacy might be one of the few they know.\nThe vast majority of drugs in that pharmacy couldn\u0026rsquo;t be reimbursed. Only in a tiny, shabby room deep inside were a small selection of reimbursable drugs. I looked around and barely recognized any of them — all named with chemical formulas, completely incomprehensible. Only things like Ganmaoling and loquat syrup were familiar. My grandfather fell into the same difficulty choosing. He recognized cephalosporin, but the pharmacy girl said they didn\u0026rsquo;t have it. He got a bit angry and said to her: \u0026ldquo;Don\u0026rsquo;t you have any decent medicine?\u0026rdquo; (\u0026ldquo;Decent medicine\u0026rdquo;? I tried to parse what he meant.) The girl pulled out a red box of nicely packaged health supplements from somewhere. My grandfather couldn\u0026rsquo;t read the tiny text on the box, so he asked me to read it to him and tell him what it treated. I looked at it — the thing claimed to treat everything — so I didn\u0026rsquo;t read it and handed it back to the girl. In the end, they only picked up a few common cold and cough remedies.\nMy grandfather repeatedly told me along the way that he gets 170 yuan of medical insurance reimbursement per year. I could tell he really, really wanted to spend that 170 yuan, to stockpile some medicine at home. That\u0026rsquo;s why he wanted to go to a Western pharmacy, and that\u0026rsquo;s why we walked all the way to this pharmacy that accepts insurance reimbursement.\nBut there was some trouble at checkout. The cashier girl had looked unhappy from the start. She took the medicine and rattled off a bunch of things I didn\u0026rsquo;t understand — and my grandfather clearly didn\u0026rsquo;t either. The only thing we caught was: \u0026ldquo;These can\u0026rsquo;t be reimbursed.\u0026rdquo;\nThe girl said: \u0026ldquo;There\u0026rsquo;s a threshold fee of 150 yuan for reimbursement, and you haven\u0026rsquo;t paid the threshold fee yet.\u0026rdquo;\nMy grandfather said: \u0026ldquo;Is the threshold fee like the 150-yuan bed fee hospitals used to charge?\u0026rdquo;\nThe girl paused, then said impatiently: \u0026ldquo;Yes, yes, whatever you say is right.\u0026rdquo;\nMy grandfather got a bit angry: \u0026ldquo;Forget it, I don\u0026rsquo;t want them!\u0026rdquo;\nI quickly asked the girl what exactly this threshold fee meant. Without a word, she pointed to a notice posted on the window — a table explaining the threshold fee. I couldn\u0026rsquo;t quite make sense of it either, but I understood that this threshold fee had to be paid. I thought about it — when I see a doctor, I just swipe my insurance card directly. What\u0026rsquo;s all this about reimbursement? I was even more confused.\nI said: \u0026ldquo;Can I use my insurance card?\u0026rdquo;\nThe girl said: \u0026ldquo;Out-of-region cards won\u0026rsquo;t work.\u0026rdquo;\nI said: \u0026ldquo;Can I just pay with Alipay?\u0026rdquo;\nSeeing me about to pay, my grandfather immediately stopped me: \u0026ldquo;There\u0026rsquo;s absolutely no way I\u0026rsquo;m letting you pay for this.\u0026rdquo; He pulled cash from his bag and paid. I understood — for me, a hundred-something yuan is nothing, but for them, it\u0026rsquo;s still money they\u0026rsquo;re reluctant to part with.\nTechnology # We marvel at how fast technology advances, always bringing new things that change our way of life and make it more convenient. Working people chase technology and immerse themselves in it. But for the elderly, technology is an entirely different story.\nIn every aspect of elderly people\u0026rsquo;s lives, things involving technology are exceedingly rare. The most commonly used thing is a phone — a smartphone. They seem to have adapted well to the fast-paced entertainment of apps like Douyin (TikTok), and they also play with their phones watching short videos before bed. (What they use is probably not actual Douyin, but some other app with recommended short videos.)\nBut that\u0026rsquo;s about the limit. They don\u0026rsquo;t really understand how phones work. For example, when they make phone calls — whether it\u0026rsquo;s my grandmother or grandfather — neither of them hangs up after finishing a call. It\u0026rsquo;s not that they don\u0026rsquo;t want to; they just don\u0026rsquo;t know where to find the hang-up button. If after a call they look at the phone and see a red hang-up button, they\u0026rsquo;ll press it. But if the screen is locked or the screen has changed, they won\u0026rsquo;t know how. My grandmother said to me: \u0026ldquo;Take a look at my phone — after I hang up, why does it keep making noise, keep making noise~\u0026rdquo; In fact, the call hadn\u0026rsquo;t ended at all; the screen had just gone dark and she thought it was hung up. If they call someone else, it\u0026rsquo;s fine, but if they call each other, it could be a disaster — because no one hangs up.\nAnd WeChat messages — they have absolutely no grasp of how WeChat messages work. They don\u0026rsquo;t know how to find someone\u0026rsquo;s chat window, don\u0026rsquo;t know who sent them a message, don\u0026rsquo;t know where messages go. Later, when we went traveling and I took photos for them, they asked me to put the photos on their phones (meaning in their photo albums). I had to operate both of their phones one by one to download photos from WeChat, making sure the downloaded photos were immediately visible in the album — otherwise, they\u0026rsquo;d never find them.\nTraveling # Taking the old couple out to travel was an important mission of this trip. I hadn\u0026rsquo;t originally planned it — I just wanted to empty my mind, breathe some fresh air, and stay there experiencing the slow passage of time. But they really enjoy going out. As soon as I arrived, my grandfather proactively suggested I could drive them somewhere for fun.\nWe visited Zhu De\u0026rsquo;s Former Residence, Langzhong Ancient City, and Nanchong — two days and one night. Traveling with elderly people requires more consideration — they can\u0026rsquo;t sit in a car too long or walk too much. So we couldn\u0026rsquo;t really do that many things. But their philosophy of travel is different from ours; they lean more toward \u0026ldquo;checking in,\u0026rdquo; valuing the fact that they\u0026rsquo;ve \u0026ldquo;been here.\u0026rdquo; So they absolutely must take photos at landmark spots with the place name written on them~\nThey also prefer crowded places over scenic spots with few people. In Langzhong, they clearly enjoyed being inside the ancient city — the bustling, noisy, lively atmosphere. They even video-called my aunt and shouted, \u0026ldquo;We\u0026rsquo;re in Langzhong!!\u0026rdquo; (with heavy emphasis), grinning ear to ear. Meanwhile, at White Pagoda Hill (you can drive up, very elderly-friendly), overlooking the panoramic view of Langzhong, I was immersed in a \u0026ldquo;what a view\u0026rdquo; moment. My grandmother looked for two minutes, took two photos, and that was it. I said, \u0026ldquo;Look at the scenery, it\u0026rsquo;s so beautiful — we came all the way up here.\u0026rdquo; She replied, \u0026ldquo;I already looked.\u0026rdquo;\nThe Shed # Usually, my grandfather goes to the small park to watch others play cards, and my grandmother plays mahjong — no money involved.\nBesides cards, the place they spend the most time is the shed downstairs. A few discarded stools and chairs from various families are gathered under the shed, with a stove in the middle where they burn firewood in winter. Everyone upstairs and downstairs knows each other — all grandparent-aged, on very good terms with my grandparents. Neighbors will sit together and chat whenever they\u0026rsquo;re free. This is the most important social venue for the \u0026ldquo;neighborhood\u0026rdquo; (it\u0026rsquo;s not really a neighborhood, just two buildings).\nOne evening, I sat in the shed listening to them talk. One grandmother said: \u0026ldquo;Your grandson is so good, taking leave to come back and keep the elderly company. We all say your grandson is wonderful.\u0026rdquo; I was a bit embarrassed, but thought — let this evaluation stay in the minds of these elders.\nOne grandfather said: \u0026ldquo;I told xx\u0026rsquo;s family: come back once a month, spend time with the elderly, no need to give them money. What would they use that money for? They can get by just fine. But you\u0026rsquo;ve grown up and left, and without company for a long time, they feel lonely.\u0026rdquo; I thought, this old man really understands things. He added: \u0026ldquo;Once, so-and-so died. His whole family came back for the funeral. They brought him fruit and food — what\u0026rsquo;s the use? Did he get to eat any of it? To put it bluntly, that was all for show — for us to see. Once a person is gone, none of it matters.\u0026rdquo; Wow. This old man truly gets it.\nComing back once a month is extremely difficult for working people — it\u0026rsquo;s just not realistic. Next year I won\u0026rsquo;t even have this childcare leave anymore. When will I come back next time? I can\u0026rsquo;t think of an answer. As we pass day after day in relentless busyness at work, how do the elderly pass their days — day after day of idleness and loneliness?\nRandom Thoughts # How should the elderly face death? Every time they mention death, it\u0026rsquo;s always with a joking tone, but more than that, there\u0026rsquo;s resignation. How should a person face death? When I\u0026rsquo;m old and the deadline is approaching, how will I face it?\nWhile chatting in the shed, I couldn\u0026rsquo;t name any of these grandparent-aged people, but they all remembered me, knew how I grew up. My life seems to be a part of their lives, proof of my existence — even if this memory only lasts for a time. Yet that still has meaning, doesn\u0026rsquo;t it? The bonds of life exist in this way. There are billions of people in this world, and the vast, vast majority are fleeting meteors — remembered by no one, mentioned in no record.\nThis society is remarkably unfriendly to the elderly. Social rules are too complex; they struggle to understand phones, healthcare, and insurance systems, so they can only huddle within their own social circles and flee from this incomprehensible society. At the same time, society has developed rapidly in recent years — children have mostly moved away for their own families and careers. For the elderly, they\u0026rsquo;re happy to see their children thriving, but the distance is vast, and mutual companionship is hard to come by. While society focuses on childcare and increasing birth benefits, no one pays attention to the issue of eldercare and companionship. I doubt there will ever be such a thing as \u0026ldquo;eldercare leave.\u0026rdquo;\nMy grandmother has poor hearing. Even with a hearing aid, it\u0026rsquo;s only slightly better. Often when I talk to her, she doesn\u0026rsquo;t follow at all and answers about something else entirely. But I can\u0026rsquo;t bring myself to raise my voice — it feels so rude. Leaning in close to speak makes her self-conscious. I suggested they come live with us in Chengdu, but she wouldn\u0026rsquo;t agree under any circumstances. I think maybe it\u0026rsquo;s because her hearing loss makes her afraid of communicating with people, timid in social situations. Only there, in the mining community, do the neighbors treat her well — it gives her a sense of security. One elderly woman said: \u0026ldquo;Being hard of hearing is good — it adds years to your life.\u0026rdquo;\nWritten — April 2023\nThe term \u0026ldquo;Popo\u0026rdquo; (a Chinese term for grandmother) still exists in my generation, but my children no longer say \u0026ldquo;Popo\u0026rdquo; — they say \u0026ldquo;Nainai\u0026rdquo; instead. Perhaps \u0026ldquo;Popo\u0026rdquo; is the last time this term will be used in our family line — may be the last call. Let it be preserved in this essay.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/people-from-another-world/","section":"Posts","summary":"​ Vacation # I took a long vacation and went back to my hometown before my leave days expired — not just to escape the busyness of work, but also to visit my grandparents. For working people like us, going back to our hometown is really difficult. If it’s just a weekend trip, we’d only get one day of rest before having to head back — too exhausting. We don’t get many vacation days to begin with, and when we do, most people think about driving out to see some scenery or just staying home for a few days doing nothing. No one usually thinks of using their precious leave to visit elderly relatives back home.\n","title":"People from Another World","type":"posts"},{"content":" Problem Description # PostgreSQL DELETE was failing with attempted to delete invisible tuple, but SELECT with the same conditions worked fine.\ndelete from lzltab1; select count(*) from lzltab1; Results of full-table delete and full-table select:\nM=# delete from lzltab1; ERROR: 55000: attempted to delete invisible tuple LOCATION: heap_delete, heapam.c:2500 Time: 511.050 ms M=# select count(*) from lzltab1; count -------- 231187 DELETE found an invisible tuple, but SELECT was fine.\nThis seemed very strange at first. PG visibility is determined by the tuple\u0026rsquo;s xmin, xmax, cid and the snapshot\u0026rsquo;s xmin, xmax, xip_list. Although the transaction state and timing of the tuple deletion can affect visibility, if the table data is stable (no ongoing DML), any subsequent snapshot should yield a stable visibility set. There shouldn\u0026rsquo;t be a case where the current transaction\u0026rsquo;s visibility differs from others — DML transaction tuple visibility should be consistent. In other words, in this scenario, the SELECT snapshot and DELETE snapshot shouldn\u0026rsquo;t produce different results.\nAnalysis # Finding the Source Code # Note the error location: heapam.c:2500\nFind the source at src/backend/access/heap/heapam.c.\nLine 2500 is blank; nearby code is:\n/* * Before locking the buffer, pin the visibility map page if it appears to * be necessary. Since we haven\u0026#39;t got the lock yet, someone else might be * in the middle of changing this, so we\u0026#39;ll need to recheck after we have * the lock. */ if (PageIsAllVisible(page)) visibilitymap_pin(relation, block, \u0026amp;vmbuffer); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); From the source, it\u0026rsquo;s trying to acquire a lock on the VM, so the problem appears related to the VM file.\nThe VM File # What is the VM file?\nThe VM (Visibility Map) file exists to reduce the time vacuum spends scanning pages. If a page doesn\u0026rsquo;t need vacuuming, it can be skipped, greatly reducing the time spent finding pages that need cleaning. This is the original purpose of the VM file. (It\u0026rsquo;s also sometimes used by index-only scans, but that doesn\u0026rsquo;t apply here since we\u0026rsquo;re doing a sequential scan.)\nThe VM file stores two pieces of information:\nWhether all tuples on a page are visible. This means the page has no dead tuples needing vacuum. Whether all tuples on a page are frozen. This means vacuum freeze doesn\u0026rsquo;t need to visit this page. The VM helps vacuum find dead tuples while reducing the number of pages scanned. For example, in the diagram above (interdb ftw!), the first page contains no dead tuples, so vacuum can skip it.\nFinding the VM File\nEvery table has a Visibility Map (VM) file (indexes don\u0026rsquo;t have VM files), stored alongside the table file. If a table\u0026rsquo;s filenode is 12345, its VM file is 12345_vm.\nFirst, cd to the data directory:\nM=# show data_directory; data_directory ---------------------- /pg/pg6666/data Find the file storage location using the database OID and table OID:\n=# select oid,datname from pg_database where datname=\u0026#39;sdp\u0026#39;; -------+---------------------- oid | datname 17075 | sdp =# select oid,relname from pg_class where relname=\u0026#39;lzltab1\u0026#39;; -------+---------------------- 17362 | lzltab1 Or:\n# select pg_relation_filepath(\u0026#39;lzltab1\u0026#39;); pg_relation_filepath ---------------------- base/17075/17362 Find the data file and VM:\n$ cd /pg/pg6666/data/base/17075 $ ll 17362* -rw------- 1 postgres postgres 86761472 Jun 15 17:43 17362 -rw------- 1 postgres postgres 40960 Jun 9 21:09 17362_fsm -rw------- 1 postgres postgres 8192 Nov 14 2022 17362_vm The pg_visibility Extension # pg_visibility provides page-level visibility information by inspecting VM files, and can detect VM corruption. Since the VM stores \u0026ldquo;are all tuples on this page visible; are all tuples on this page frozen\u0026rdquo; information, pg_visibility can identify which pages are all-frozen and which are all-visible.\npg_visibility extension reference: https://www.postgresql.org/docs/current/pgvisibility.html\nUseful pg_visibility Functions # pg_visibility_map_summary(): Shows the count of all-visible and all-frozen pages in the VM.\npg_check_frozen(): Returns rows where a tuple is not frozen but its page is marked all-frozen in the VM. If this function returns results, the VM file is corrupt.\npg_check_visible(): Returns rows where a tuple is not visible but its page is marked all-visible in the VM. If this function returns results, the VM file is corrupt.\npg_truncate_visibility_map(): Clears the VM file. After clearing, the next vacuum on the table will scan all pages and rebuild the VM.\nRepairing the VM File # Check for VM corruption:\nM=# select pg_visibility_map_summary(\u0026#39;lzltab1\u0026#39;); pg_visibility_map_summary --------------------------- (472,0) 472 all-visible pages, 0 all-frozen pages.\nM=# select pg_check_frozen(\u0026#39;lzltab1\u0026#39;); pg_check_frozen ----------------- (0 rows) M=# select pg_check_visible(\u0026#39;lzltab1\u0026#39;); pg_check_visible ------------------ (6839,1) (6839,2) ... (7296,15) (1423 rows) pg_check_visible() returning results means the VM is corrupted.\nNow use pg_truncate_visibility_map() to clear the VM:\nM=# select pg_truncate_visibility_map(\u0026#39;lzltab1\u0026#39;); pg_truncate_visibility_map ---------------------------- On disk, you can see the VM was cleared:\nll 17362* -rw------- 1 postgres postgres 86761472 Jun 27 10:39 17362 -rw------- 1 postgres postgres 40960 Jun 9 21:09 17362_fsm -rw------- 1 postgres postgres 0 Jun 27 18:18 17362_vm Now verify by vacuuming the table to regenerate the VM file and check it\u0026rsquo;s not corrupted:\nM=# vacuum lzltab1; VACUUM Time: 3692.402 ms (00:03.692) M=# \\q $ ll 17362* -rw------- 1 postgres postgres 86761472 Jun 28 03:37 17362 -rw------- 1 postgres postgres 40960 Jun 9 21:09 17362_fsm -rw------- 1 postgres postgres 8192 Jun 28 10:21 17362_vm After manual vacuum, the VM was regenerated correctly:\nM=# select pg_check_visible(\u0026#39;lzltab1\u0026#39;); pg_check_visible ------------------ (0 rows) M=# select pg_check_frozen(\u0026#39;lzltab1\u0026#39;); pg_check_frozen ----------------- (0 rows) Both checks return empty — VM file is healthy. Repair complete.\nFinally, re-run the SQL:\n## delete from lzltab1; DELETE 229766 DELETE executes normally. Problem resolved.\nChecking the Entire Database for VM Corruption # Although we fixed one corrupted VM file, we should check the entire database for other VM corruption (requires the pg_visibility extension installed):\nSELECT oid::regclass AS relname FROM pg_class WHERE relkind IN (\u0026#39;r\u0026#39;, \u0026#39;m\u0026#39;, \u0026#39;t\u0026#39;) AND ( EXISTS (SELECT * FROM pg_check_visible(oid)) OR EXISTS (SELECT * FROM pg_check_frozen(oid))); If results are returned, there\u0026rsquo;s VM corruption. Use pg_truncate_visibility_map() to clear the VM, then vacuum to regenerate it, as shown above.\nFor versions before 9.6 (which lack the pg_visibility extension), you\u0026rsquo;d need to stop the database, manually delete the VM files, restart, then vacuum to regenerate them.\nWhy Does VM Corruption Happen? # We traced the issue step by step to VM file corruption, but why did it corrupt?\nPostgreSQL bugs. PG has had some bugs causing VM corruption (see Visibility Map Problems wiki), but these were all before PG 9.6.1. Operating system or hardware issues. Our version was PG13, so the cause can only be broadly attributed to OS or hardware problems.\nWhy Did SELECT Succeed But DELETE Fail? # A full-table SELECT working while a full-table DELETE errors out seems bizarre. The root cause is VM file corruption.\nAs mentioned, the VM file exists to speed up vacuum. Even though we weren\u0026rsquo;t running vacuum, the VM file still needs to be updated — DML operations always update (or at least check) the VM, while SELECT does not change VM state. So in this case, SELECT executed normally, but DELETE errored during VM processing.\nIn our case, DELETE scanned the VM and found pages marked all-visible, but the VM was wrong — those pages still contained invisible tuples. This is exactly the attempted to delete invisible tuple error. Invisible tuples may have already been deleted, and trying to delete them again naturally errors out, violating transaction visibility rules.\nAdditionally, index-only scans also use the VM file, so they would also be affected. However, this case involved a sequential scan, so SELECT was unaffected.\nVM Corruption Causing Incorrect Index-Only Scan Results # As mentioned earlier, besides vacuum, index-only scans also use the VM file. Even though our case didn\u0026rsquo;t involve index-only scans, let\u0026rsquo;s dig deeper for completeness.\nWhat Is an Index-Only Scan? # As the name suggests, an index-only scan accesses only the index structure to get results, without touching the table. Almost all relational databases support index-only scans because B+tree index structures store key values — if the query only needs key values, an index-only scan is possible.\nHowever, PostgreSQL\u0026rsquo;s transaction implementation differs significantly from other databases (Oracle, MySQL), giving its index-only scans some unique characteristics.\nPostgreSQL checks tuple visibility via xmin, xmax, and other information in tuple headers, but indexes don\u0026rsquo;t contain this information. This means PG\u0026rsquo;s index-only scans must visit data blocks to check visibility. This is where the VM comes in: since the VM stores all-visible and all-frozen information, pages marked as such don\u0026rsquo;t need visibility checks — the VM has already confirmed their visibility.\nAnother interdb diagram (interdb ftw!). When a query looks up tuples with keys 18 and 19: the page containing key=18 is marked all-visible in the VM, so accessing this tuple only requires the index page and VM file. The page containing key=19 is not marked all-visible, so the index-only scan still needs to visit the data page to check visibility.\nIndex-Only Scan Returning Incorrect Results # Because index-only scans consult the VM, and a corrupted VM stores wrong information — e.g., a page\u0026rsquo;s tuples aren\u0026rsquo;t all visible (some may have been deleted), but the page is still marked all-visible — the index-only scan skips the data page visibility check and directly returns index key values that should be invisible.\nYou can set enable_indexonlyscan=off to disable index-only scans and guarantee correct results. Or, as shown above, repair the VM file — which is probably the better choice.\nSummary # The journey had some twists: at first glance the error seemed like a transaction visibility rule problem, which would have been serious — but it was actually much simpler.\nWe traced the attempted to delete invisible tuple error to the source code, identified it as a VM issue, used the pg_visibility extension to detect and fix the VM corruption, resolved the DELETE error, and finally explored the relationship between index-only scans and the VM.\nKey takeaways:\nThe pg_visibility extension can read, check, and clear VM files Without VM information, vacuum will generate a new VM DML reads/updates VM files; SELECT does not (non-index-only-scan) The VM file exists to improve vacuum efficiency, and sometimes index-only scan efficiency The attempted to delete invisible tuple error warrants checking the VM file for corruption VM file corruption can cause DML failures and incorrect index-only scan results References # https://www.postgresql.org/docs/13/pgvisibility.html\nhttps://wiki.postgresql.org/wiki/Visibility_Map_Problems\nhttps://www.interdb.jp/pg/pgsql06.html\nhttps://www.interdb.jp/pg/pgsql07.html\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/pg-error-attempted-to-delete-invisible-tuple/","section":"Posts","summary":"Problem Description # PostgreSQL DELETE was failing with attempted to delete invisible tuple, but SELECT with the same conditions worked fine.\ndelete from lzltab1; select count(*) from lzltab1; Results of full-table delete and full-table select:\nM=# delete from lzltab1; ERROR: 55000: attempted to delete invisible tuple LOCATION: heap_delete, heapam.c:2500 Time: 511.050 ms M=# select count(*) from lzltab1; count -------- 231187 DELETE found an invisible tuple, but SELECT was fine.\nThis seemed very strange at first. PG visibility is determined by the tuple’s xmin, xmax, cid and the snapshot’s xmin, xmax, xip_list. Although the transaction state and timing of the tuple deletion can affect visibility, if the table data is stable (no ongoing DML), any subsequent snapshot should yield a stable visibility set. There shouldn’t be a case where the current transaction’s visibility differs from others — DML transaction tuple visibility should be consistent. In other words, in this scenario, the SELECT snapshot and DELETE snapshot shouldn’t produce different results.\n","title":"PG Error: attempted to delete invisible tuple","type":"posts"},{"content":"Interview questions source: PostgreSQL Apprentice PostgreSQL Interview Questions Collection\nExisting answers: Hehuyi_In Learning and Answering PostgreSQL Interview Questions\n1. MVCC Implementation and Differences from Oracle # ORACLE and MYSQL both use UNDO to implement multi-version concurrency control. Undo entries are recorded in additional undo tablespaces. If the UNDO segment is insufficient, an ora-01555 error occurs. https://www.slideshare.net/AmitBhalla2/less10-undo-15946188\nPostgreSQL has no undo mechanism. To ensure transaction rollback, old tuples remain on the table. For example, an update inserts a new row while the old data stays in place. Tuple headers, clog, etc. determine which tuple version is valid. Visibility information in tuple headers includes xmin, xmax, cmin, cmax, infomask, and infomask2, stored in the tuple header.\nhttps://www.interdb.jp/pg/pgsql05/03.html\nPros/cons: The undo approach requires extra undo space; space management is simpler. However, large transaction rollback is very troublesome since undo segments must be rolled back. The new-tuple approach makes large transaction rollback very fast, but this method creates dead tuples, requiring a vacuum mechanism to clean them. Vacuum freeze itself isn\u0026rsquo;t directly related to dead tuple cleanup (though both are vacuum processes); freeze prevents transaction ID wraparound.\n2. Why Table Bloat Occurs and Its Hazards # Why table bloat?\nAs above, due to PostgreSQL\u0026rsquo;s unique MVCC mechanism, delete doesn\u0026rsquo;t truly remove tuples, and update equals delete+insert. Old tuples cannot be removed by DML statements, so space only \u0026ldquo;grows\u0026rdquo; without \u0026ldquo;cleaning\u0026rdquo; — this is table bloat. Vacuum is generally needed to clean dead tuples and mark space as available; or vacuum full rewrites the table for compaction.\nHazards of table bloat:\nExcessive table space usage SQL performance degradation Large tables cause longer vacuum cleanup times; vacuum full blocking time also increases, though pg_repack can replace vacuum full to reduce blocking Handling table bloat:\nManual vacuum Does not block queries or DML operations Does not immediately reclaim space, only marks it as available If the last page of a table has no tuples, that page gets truncated (https://www.interdb.jp/pg/pgsql06.html)\nAutovacuum Autovacuum automatically invokes vacuum for concurrent cleanup as needed Manual vacuum full 8-level lock, blocks everything Table is completely rewritten; corresponding OS files are cleaned and rebuilt Rebuilds indexes, FSM (free space map), VM (visibility map) pg_repack and other manual table rebuilds pg_repack only has a brief lock during the final table switch Other tools with data sync and switch capabilities Avoiding table bloat:\nGenerally, autovacuum handles table bloat, but cleanup may not proceed smoothly in some scenarios:\nAutovacuum worker isn\u0026rsquo;t running Both autovacuum and track_counts must be enabled for autovacuum to work autovacuum_max_workers must be set high enough; multiple workers may be needed simultaneously Table hasn\u0026rsquo;t reached vacuum threshold — rows deleted/updated: threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * tuples autovacuum_vacuum_insert_threshold and autovacuum_vacuum_insert_scale_factor represent insert thresholds (same algorithm). Insert-triggered vacuum thresholds theoretically have little to do with bloat cleanup since inserts don\u0026rsquo;t generate dead tuples. However, to prevent wraparound issues from not being handled in time, pg13 added this parameter (reference: postgresql-autovacuum-insert-only-tables) autovacuum_naptime is the autovacuum launcher cycle. If set too large, autovacuum_max_workers may be sufficient and tables may meet thresholds, but the launcher hasn\u0026rsquo;t woken workers vacuum_defer_cleanup_age delays vacuum cleanup by N transactions (originally designed to alleviate standby query conflicts; since hot_standby_feedback and replication slots exist, pg16 removed this parameter) Disable or adjust cost-based vacuuming to make autovacuum faster Cost-based vacuuming may be enabled to reduce vacuum\u0026rsquo;s IO impact. When vacuum/autovacuum reaches the cost limit, it sleeps for autovacuum_vacuum_cost_delay (or vacuum_cost_delay) milliseconds. vacuum_cost_delay defaults to 0 (disabling cost-based vacuuming); autovacuum_vacuum_cost_delay at -1 means using the vacuum_cost_delay setting. Disable delay or reduce the delay value If cost-based vacuuming is enabled, reasonably increase vacuum_cost_limit trigger threshold and reduce the vacuum_cost_page_dirty, vacuum_cost_page_miss, vacuum_cost_page_hit values that count toward the limit Active transactions preventing vacuum Business long transactions not finished. Application-side transactions shouldn\u0026rsquo;t run too long; database-side can kill sessions: 1) manual kill 2) set idle_in_transaction_session_timeout to limit idle time 3) set old_snapshot_threshold to limit SQL execution (not recommended before PG14) Unclosed cursors hot_standby_feedback enabled: primary records catalog_xmin, standby long queries prevent primary cleanup Remove unused replication slots Orphan transactions. Prepared transactions are explicit 2PC transactions inside PG. If a prepared transaction is opened but not completed, and prepared transactions are unrelated to sessions, orphan transactions block indefinitely pg_dump logical backup opens implicit repeatable read isolation level; transaction not finished Performance aspects maintenance_work_mem is memory for maintenance operations like vacuum; default 64MB can be increased. Or use autovacuum_work_mem separately for autovacuum workers; default -1 means using maintenance_work_mem Large table vacuum is especially slow; since vacuum can\u0026rsquo;t parallelize on the same table, convert large tables to partitioned tables so vacuum can run in parallel across partitions Good IO system Adjust per-table autovacuum parameters Global autovacuum settings may not suit certain business tables; adjust per-table autovacuum parameters to increase vacuum trigger probability Manual vacuum Autovacuum is generally unpredictable; for special business tables, manual vacuum Run manual vacuum during low-traffic periods, optionally with freeze and analyze The above handles 99.99% of table bloat problems. One type of bloat is harder to address: with cost-based vacuuming disabled, autovacuum dead tuple cleanup speed cannot keep up with generation speed. Essentially, too many concurrent update (or insert+delete) transactions mean this round of vacuum hasn\u0026rsquo;t finished cleaning available space before massive updates generate new space and dead tuples, causing continuous bloat. Solutions:\nConvert to partitioned tables for vacuum parallelism (only meaningful if updates are distributed across partitions) Run vacuum full or pg_repack during off-peak hours to thoroughly clean table holes 24/7 high-concurrency tables are unlikely; if they exist, restructure to multi-table writes or move to caching systems like Redis Unveiling the Mystery of Table Bloat\nhttps://www.interdb.jp/pg/pgsql06.html\nhttps://www.postgresql.org/docs/16/routine-vacuuming.html\nhttps://www.postgresql.org/docs/16/runtime-config-autovacuum.html\nhttps://www.postgresql.org/docs/16/runtime-config-resource.html#GUC-VACUUM-COST-DELAY\n3. Long Transaction Hazards and How to Trace Them # Regular queries don\u0026rsquo;t generate transaction IDs but virtual transaction IDs (vxid). Virtual transaction IDs consist of backendID and a backend-local counter, unrelated to transaction ID (XID). However, although queries don\u0026rsquo;t generate transaction IDs, they hold snapshots for visibility checks. Snapshots contain tuple xmin and other information.\n(https://www.interdb.jp/pg/pgsql05/05.html)\nSo long transaction issues involve both DML and query statements, though their lock types differ.\nLong transaction hazards:\nBlocks vacuum cleanup, causing table bloat, excessive space usage, and SQL performance degradation Blocks other lock requests; e.g., DDL must check for long transactions before execution, otherwise long waits for higher-level locks cause lock escalation Long transactions cause create index concurrently to fail, leaving invalid indexes Occupies connection pool (though mainly a long-connection issue) Logical decoding data spilling to disk causing replication lag, also related to large transactions A long transaction with a savepoint subtransaction can cause query performance cliffs (reference: Why we spent the last month eliminating PostgreSQL subtransactions) How to trace long transactions:\npg_stat_activity: check xact_start for transaction start time, state_change for whether transaction is still running 4. Subtransaction Hazards and Considerations # Subtransaction hazards:\nExcessive transaction ID consumption, premature wraparound handling. Each subtransaction consumes one XID PGPROC_MAX_CACHED_SUBXIDS overflow causing performance degradation. Each backend has a subtransaction cache of PGPROC_MAX_CACHED_SUBXIDS, fixed at 64 subtransactions (hardcoded). Exceeding 64 subtransactions spills to the pg_subtrans directory (reference: PostgreSQL Subtransactions Considered Harmful) Using subtransactions with FOR UPDATE explicit row locks causes dramatic database performance degradation (reference: Notes on some PostgreSQL implementation details) A long transaction with a savepoint subtransaction can also cause query performance cliffs (reference: Why we spent the last month eliminating PostgreSQL subtransactions) Usage recommendations:\nSubtransaction usage is discouraged given the above hazards If standby query workloads exist, prohibit subtransactions If subtransactions are still needed, keep them under 64 (preferably much lower) Besides explicit savepoints, subtransactions can also arise from exceptions, frameworks, and tools pg事务：子事务\n5. Which Schema Changes Are Non-Online # All schema changes are non-online because all ALTER TABLE operations require an 8-level lock. However, some schema changes themselves take a long time or cause slow queries afterward. So this question can be reframed as three sub-questions:\nImpact on indexes? Impact on statistics? Does it require rewriting the table, causing long-held 8-level locks?\nSchema Change Summary Chart\nSummary:\nDropping a column completes immediately, but watch for composite index and multi-column statistics invalidation to avoid SQL performance avalanches Adding a column with a default value: 1) Pre-pg10 requires table rewrite 2) pg11+: only volatile function defaults require table rewrite. Also, statistics won\u0026rsquo;t be immediately available for the new column Changing column length: enlarging (except int to bigint) doesn\u0026rsquo;t rewrite the table; shrinking requires table rewrite; column statistics invalidated Changing column type: table rewrite; statistics invalidated Adding constraints to existing columns scans the table, watch for scan duration (e.g., ADD CONSTRAINT, SET NOT NULL) Adding defaults to existing columns completes immediately (e.g., SET/DROP DEFAULT) SET { LOGGED | UNLOGGED } rewrites the table Storage parameter changes depend on what\u0026rsquo;s changing. E.g., fillfactor and autovacuum parameters are online, non-8-level-lock, immediate (reference: Storage Parameters) 6. Physical Backup Considerations (pg_start_backup) # (https://postgrespro.com/media/2022/03/24/pgpro-backup-methods%20(1).pdf)\nPG physical backup:\nBlock-level backup, generally doesn\u0026rsquo;t support per-database backup (except pg_probackup) Exclusive mode is unnecessary because: 1) only works on primary 2) doesn\u0026rsquo;t allow parallel backup 3) created backup label may prevent primary instance recovery 4) functionally identical to non-exclusive backup. PG9.6 added non-exclusive mode; PG15 removed exclusive mode If explicitly using pg_start_backup(), must explicitly use pg_stop_backup() to end backup mode (function names differ slightly in PG15+) FPI (full page image) is force-enabled during backup, even if full_page_writes is off All tools (maybe) call pg_stop_backup() before backup starts for a checkpoint to flush dirty data, and back up all WAL from start to end, even newly generated WAL during backup, ensuring data consistency and PITR pg_basebackup:\nNative, built-in Wraps pg_start_backup and pg_stop_backup commands PG17+ supports incremental backup and backup set merging Consumes one walsender process pg_probackup:\nVery powerful: supports incremental backup, incremental restore, parallelism, backup set merging, backup verification, remote backup, per-database restore, etc. BUG: address space cannot exceed 4GB, fixable by modifying source code pgBackRest:\nAlso very powerful Prerequisite: SSH must be configured from backup server to database host https://developer.aliyun.com/article/59359\nhttps://www.postgresql.org/docs/current/app-pgbasebackup.html\nhttps://www.enterprisedb.com/blog/exclusive-backup-mode-finally-removed-postgres-15\nhttps://github.com/MasaoFujii/pg_exclusive_backup\nhttps://github.com/postgrespro/pg_probackup\nhttps://pgbackrest.org/user-guide.html\n7. How Logical Backup Ensures Consistency # pg_dump completes a full backup within a single transaction, with isolation level serializable or repeatable read Before backing up data, pg_dump acquires ACCESS SHARE locks on target objects to prevent table drops Additional logical backup considerations:\nWatch for lock conflicts during export If DDL operations are needed, avoid full-database or long-duration backups; split the backup into multiple tasks, e.g., one table per pg_dump invocation https://developer.aliyun.com/article/14582\n8. Causes of WAL Accumulation # Invalid replication slots Logical replication with long transactions Excessively large wal_keep_size Excessively small archive_timeout, forcing WAL switches and archiving (equivalent to pg_switch_xlog() + archiving) Archive failures generating .ready files Single-process archiving can\u0026rsquo;t keep up FPI full page writes (check for overly frequent checkpoints, UUID-like scattered write patterns) 9. Hazards of Long Connections # When PG acquires snapshot data, it must scan all backend process transaction states. Too many connections degrade performance (recommended max ~1000; pg14 optimized but still not recommended to exceed) relcache/syscache doesn\u0026rsquo;t release cached metadata, and each process caches independently, causing high memory consumption 10. Role of Infomask Flags # Infomask provides transaction, lock, and tuple status information, such as whether a transaction is committed/aborted, row lock info, HOT info, column count, etc. The header has two infomasks: infomask and infomask2. They store different information, with different bits representing different meanings Hint bits also write transaction info to infomask, so visibility can be determined from tuple headers alone without accessing clog pg事务：事务相关元组结构\n11. How NULL Values Are Stored and Whether Indexes Store NULLs # How NULL values are stored:\nNULL is stored in the tuple header, not the tuple data area One bit in infomask marks whether the tuple contains NULLs t_bits has n*8 bits (n integer; e.g., a 10-column table has 16-bit t_bits), with a bitmap representing which columns are NULL Whether indexes store NULL values:\nPostgreSQL indexes store NULL values; Oracle indexes don\u0026rsquo;t Storage position depends on (NULLS FIRST or NULLS LAST) https://www.highgo.ca/2020/10/20/the-way-to-store-null-value-in-pg-record/\n12. Why Full Page Writes Are Needed # The official documentation\u0026rsquo;s introduction to full page writes is fairly general:\nThis is needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint)\nOS file pages are typically 4KB, while PG pages are typically 8KB. Partial writes can occur, where a disk data page contains both old and new data, causing data loss during recovery. Hence the need for full page writes.\nPartial writes are closely related to disk characteristics. Detailed answers are difficult; reference roger\u0026rsquo;s article. Summary:\nPartial writes relate to whether the disk supports atomic writes Partial writes relate to whether OS block size matches database block size. Oracle/PG blocks default to 8KB, MySQL to 16KB, OS to 4KB. A database\u0026rsquo;s minimum IO requires multiple OS calls For PG, if a data page experiences partial write, it can recover using full page images in WAL For MySQL, there\u0026rsquo;s a double write mechanism. The double write buffer is on-disk space, written sequentially before data pages to mitigate partial write For Oracle, much work has been done but no obvious solution exists. However, Oracle supports block-level recovery to replace corrupted data blocks Different DBs adopt different approaches to reduce partial writes. PG writes the entire data page to WAL logs, but this causes WAL write amplification. This can be mitigated through various means.\nHow to perfectly solve the partial write problem?\nAtomic write-capable devices OS minimum IO matching database minimum IO http://www.killdb.com/2020/04/05/double_write_partial_write_oracle_mysql_postgresql/\n13. Various Causes of Index Invalidation # Index invalidation:\nCREATE INDEX CONCURRENTLY can leave an invalid index due to deadlock or unique index check failure; invalid indexes still get updated Invalid indexes on partitioned parent tables indicate some partitions have the index while others don\u0026rsquo;t Index not being used:\nInaccurate statistics Selectivity Data skew Soft parsing: first 5 times cached different execution plans Leftmost prefix principle Insufficient data (hash or full scan not slower than index) Functions (unless a matching immutable function index exists), implicit conversions, operations, LIKE with leading \u0026lsquo;%\u0026rsquo;\u0026hellip; Data type mismatch Collation mismatch (less of an issue in PG since database collation can\u0026rsquo;t change after creation; data within one database shares the same collation; cross-database access is normally impossible) SQL collation sort differing from index collation sort LIKE only usable with collation C or pattern index High correlation: index logical order vs data physical order correlation; accessing scattered data via index LIMIT xx ORDER BY column1, MIN/MAX needing TOP N scenarios where the optimizer chooses another index 14. Role of Commit Log # Commit log records transaction status. During the next visibility check on a table, hint bits are triggered, writing clog transaction status to the tuple header.\nWhy not write transaction status to the tuple header immediately? Hint bits immediate update performs very poorly, so transaction status is first placed in clog, reducing PGXACT contention and improving performance.\npg事务：事务相关元组结构\n15. Database Join Methods and Their Applicable Scenarios # 1.1 Nested Loop Join\nlzldb=# explain select a from lzl1,t3 where lzl1.col1=t3.a::text; QUERY PLAN ----------------------------------------------------------- Nested Loop (cost=0.00..2.29 rows=10 width=4) Join Filter: ((lzl1.col1)::text = (t3.a)::text) -\u0026gt; Seq Scan on t3 (cost=0.00..1.01 rows=1 width=4) -\u0026gt; Seq Scan on lzl1 (cost=0.00..1.10 rows=10 width=2) The driving table (outer in the diagram, first table in the plan) matches each row against every row of the driven table (inner, second table in the plan). The driving table is scanned once; the driven table is scanned N times (N = driving table rows).\nNL suits almost all scenarios; it\u0026rsquo;s the simplest brute-force join. Generally smaller tables serve as the driving table (actually neither table should be too large, unless other join types don\u0026rsquo;t apply).\n1.2 Materialized Nested Loop Join\ntestdb=# EXPLAIN SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id; QUERY PLAN ----------------------------------------------------------------------- Nested Loop (cost=0.00..750230.50 rows=5000 width=16) Join Filter: (a.id = b.id) -\u0026gt; Seq Scan on tbl_a a (cost=0.00..145.00 rows=10000 width=8) -\u0026gt; Materialize (cost=0.00..98.00 rows=5000 width=8) -\u0026gt; Seq Scan on tbl_b b (cost=0.00..73.00 rows=5000 width=8) If the driven table (inner) needs multiple scans, physical IO each time would be very slow (and seems silly). Materialize scans the driven table into memory (work_mem), performing only one physical table scan, allowing the driven table to be accessed multiple times in memory.\nThis scenario is very common in real-world workloads.\n1.3 Indexed Nested Loop Join (inner indexed)\ntestdb=# EXPLAIN SELECT * FROM tbl_c AS c, tbl_b AS b WHERE c.id = b.id; QUERY PLAN -------------------------------------------------------------------------------- Nested Loop (cost=0.29..1935.50 rows=5000 width=16) -\u0026gt; Seq Scan on tbl_b b (cost=0.00..73.00 rows=5000 width=8) -\u0026gt; Index Scan using tbl_c_pkey on tbl_c c (cost=0.29..0.36 rows=1 width=8) Index Cond: (id = b.id) 1.4 NL Variants\nAll are essentially NL; the main variations are whether indexes are used on either table and whether Materialize is applied.\n2.1 Merge Join\ntestdb=# EXPLAIN SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id AND b.id \u0026lt; 1000; QUERY PLAN ------------------------------------------------------------------------- Merge Join (cost=944.71..984.71 rows=1000 width=16) Merge Cond: (a.id = b.id) -\u0026gt; Sort (cost=809.39..834.39 rows=10000 width=8) Sort Key: a.id -\u0026gt; Seq Scan on tbl_a a (cost=0.00..145.00 rows=10000 width=8) -\u0026gt; Sort (cost=135.33..137.83 rows=1000 width=8) Sort Key: b.id -\u0026gt; Seq Scan on tbl_b b (cost=0.00..85.50 rows=1000 width=8) Filter: (id \u0026lt; 1000) (9 rows) In merge join, both the driving and driven tables must be sorted first (both tables have Sort in the plan) before matching. Advantage: fewer table scans and matches than NL. Disadvantage: sorting required.\nSince indexes are sorted, and SQL may include DISTINCT, GROUP BY, SORT, MAX/MIN etc. requiring ordering, merge join is also common.\n2.2 Materialized Merge Join\ntestdb=# EXPLAIN SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id; QUERY PLAN ----------------------------------------------------------------------------------- Merge Join (cost=10466.08..10578.58 rows=5000 width=2064) Merge Cond: (a.id = b.id) -\u0026gt; Sort (cost=6708.39..6733.39 rows=10000 width=1032) Sort Key: a.id -\u0026gt; Seq Scan on tbl_a a (cost=0.00..1529.00 rows=10000 width=1032) -\u0026gt; Materialize (cost=3757.69..3782.69 rows=5000 width=1032) -\u0026gt; Sort (cost=3757.69..3770.19 rows=5000 width=1032) Sort Key: b.id -\u0026gt; Seq Scan on tbl_b b (cost=0.00..1193.00 rows=5000 width=1032) (9 rows) Materialize doesn\u0026rsquo;t reduce table scans (both tables scanned once), but the sort operation can happen in the backend\u0026rsquo;s work_mem for better efficiency; if exceeding work_mem, disk sort is used.\n2.3 Merge Join Variants\nSimilar to NL variants, mainly Materialize and index usage. When using indexes, since the index is inherently ordered, no extra sorting is needed:\nQUERY PLAN -------------------------------------------------------------------------------------- Merge Join (cost=135.61..322.11 rows=1000 width=16) Merge Cond: (c.id = b.id) -\u0026gt; Index Scan using tbl_c_pkey on tbl_c c (cost=0.29..318.29 rows=10000 width=8) -\u0026gt; Sort (cost=135.33..137.83 rows=1000 width=8) Sort Key: b.id -\u0026gt; Seq Scan on tbl_b b (cost=0.00..85.50 rows=1000 width=8) Filter: (id \u0026lt; 1000) (7 rows) So indexes and Materialize are very common in merge joins.\n3.1 Hash Join\nHash join consists of build and probe phases.\nThe build phase places the driving table (inner in the diagram, second row in the plan!) into work_mem; the probe phase compares hash values.\nHash join only possible with \u0026lsquo;=\u0026rsquo; conditions Hash join consumes memory; generally both tables aren\u0026rsquo;t very large Note: the driving table (hash build table) is the second row in the plan, opposite of NL 3.2 Hybrid Hash Join with Skew\nNot fully understood; appears to support spilling to disk. To be revisited.\nhttps://www.interdb.jp/pg/pgsql03/05/01.html\n16. Applicable Scenarios for Various Index Types (HASH/GIN/BTREE/GIST/BLOOM/BRIN) # (1) BTREE\nhttps://en.wikibooks.org/wiki/PostgreSQL/Index_Btree\nPossible usage patterns:\n\u0026lt; \u0026lt;= = \u0026gt;= \u0026gt;\tIS NULL IS NOT NULL LIKE \u0026#39;foo%\u0026#39; A meta node points to the root node Leaf node access complexity O(logN), N being row count Inherently sorted, easily used by ORDER BY, MIN/MAX, GROUP BY, merge joins, etc. Default index type, most common. Structure is similar across databases with leaf node structure differences (MySQL secondary index leaf nodes store index key + primary key, then access clustered index via primary key; Oracle index leaf nodes store index key + rowid; PG index leaf nodes store index key + tid) (2) HASH\n（https://leopard.in.ua/2015/04/13/postgresql-indexes）\nIndex data is converted to 32-bit hash values stored in corresponding hash buckets; different hash values point to their respective data rows.\nComplexity O(1) Hash indexes can only be used for = conditions When key values are large, they\u0026rsquo;re generally smaller than BTREE indexes and don\u0026rsquo;t need character-by-character comparison like BTREE, offering better efficiency. So hash indexes suit scenarios with large key values (3) GIST\nGIST (Generalized Search Tree) is similar to BTREE, also a balanced tree. GIST isn\u0026rsquo;t actually one index type but a framework containing many index strategies: R-TREE, RD-TREE. Unlike BTREE using =, \u0026gt; etc. for numeric/character data, GIST excels at geographic, text, image, and similar data. Geographic operators include: \u0026lt;-\u0026gt; distance calculation, \u0026lt;\u0026lt; left-of check, @\u0026gt; contains check, etc.\nGIST excels at:\nGIS data processing (similar data processing also possible, e.g., digoal-GIST index for IP range query optimization) Nearest-neighbor algorithms (pg_vector and similar vector data; to be researched) Full-text search (seems to need contrib/intarray) RTREE:\n（https://en.wikipedia.org/wiki/R-tree）\nThe most common index for GIS data is RTREE. Two-dimensional spatial data consists of coordinates; scanning coordinates one by one to find locations is slow. BTREE isn\u0026rsquo;t suitable for such data, so RTREE emerged. RTREE\u0026rsquo;s core concept is grouping nearby points using rectangles at different hierarchy levels; finer grouping yields more precise positioning.\nhttps://postgrespro.com/blog/pgsql/4175817\n(4) SP-GIST:\nSpace-Partitioned GIST is similar to GIST, also an index creation framework. SP-GIST suits structures that partition space into non-overlapping regions (unlike RTREE which overlaps), such as quadtrees, k-d trees, and radix trees.\nQuadtrees:\n（https://en.wikipedia.org/wiki/Quadtree）\nQ-TREE comes in square, rectangular, and various shapes. The most \u0026ldquo;orthodox\u0026rdquo; Q-TREE as shown above generally has these properties:\nEach internal node has four children Index follows depth structure to locate data K-d trees:\n（https://en.wikipedia.org/wiki/K-d_tree）\nK-dimensional trees manage multi-dimensional points using multi-dimensional space concepts; each non-leaf node is split in two. For example, the 3D space diagram above is a 3-dimensional k-d tree model: first split (red) divides the entire space in half; second split (green) divides subspaces in half\u0026hellip; until no further division is possible. The second diagram shows the tree structure of a 3D k-d tree (don\u0026rsquo;t mistake it for BTREE!); this tree has only 3 dimensions: Name, Age, Salary.\nRadix-tree:\n（https://en.wikipedia.org/wiki/Radix_tree）\nRadix: each child synthesizes its parent. Key lookup complexity is O(path length); if common prefixes exist, complexity is higher.\nhttps://postgrespro.com/blog/pgsql/4220639\n(5) GIN\nBTREE and GIST have very low query efficiency when there are very many key-value entries. GIN (Generalized Inverted Index) excels at such scenarios: array, full text, and JSON retrieval operations. Both GIST and GIN are generalized/framework-based, supporting multiple data index types; both also support full-text indexing. GIN only supports Bitmap scans.\nPostgreSQL natively supports many operators, some of which are GIN-related data type operators:\nArray operators, e.g., @\u0026gt; whether array1 contains array2; unnest expand array Full-text search operators, e.g., @@ whether tsvector matches tsquery Also some JSON operators PG supports two data types for full-text search: tsvector and tsquery\n1. tsvector:\ntsvector tokenizes text with deduplication and sorting, using tsvector_ops operators. Example tokenization:\nSELECT \u0026#39;The Fat Rat is a Rat\u0026#39;::tsvector; tsvector ---------------------------- \u0026#39;Fat\u0026#39; \u0026#39;Rat\u0026#39; \u0026#39;The\u0026#39; \u0026#39;a\u0026#39; \u0026#39;is\u0026#39; ::tsvector tokenization is generally not the final form; to_tsvector normalizes tokens (final form), showing token positions:\nSELECT to_tsvector(\u0026#39;english\u0026#39;, \u0026#39;The Fat Rat is a Rat\u0026#39;); to_tsvector ------------------- \u0026#39;fat\u0026#39;:2 \u0026#39;rat\u0026#39;:3,6 Note \u0026rsquo;the\u0026rsquo;, \u0026lsquo;is\u0026rsquo;, \u0026lsquo;a\u0026rsquo;, and case are all removed — this is to_tsvector\u0026rsquo;s rule, matching real-world scenarios since full-text search typically targets words.\n2. tsquery:\nNormally you can search tokenized text by word:\nSELECT to_tsvector(\u0026#39;The Fat Rat is a Rat\u0026#39;) @@ \u0026#39;rat\u0026#39;; ?column? ---------- t To search for \u0026ldquo;contains both fat and rat\u0026rdquo;, simple word input won\u0026rsquo;t work — tsquery operates on the tokens being searched.\ntsquery can be composed with \u0026amp; (AND), | (OR), ! (NOT), \u0026lt;-\u0026gt; (FOLLOWED BY). Examples:\nSELECT to_tsvector(\u0026#39;The Fat Rat is a Rat\u0026#39;) @@ to_tsquery( \u0026#39;fat\u0026amp;rat\u0026#39; ); ?column? ---------- t SELECT to_tsvector(\u0026#39;The Fat Rat is a Rat\u0026#39;) @@ to_tsquery( \u0026#39;fat\u0026amp;rat\u0026amp;cat\u0026#39;); ?column? ---------- f SELECT to_tsvector(\u0026#39;The Fat Rat is a Rat\u0026#39;) @@ to_tsquery( \u0026#39;rat\u0026lt;-\u0026gt;fat\u0026#39;); ?column? ---------- f Fulltext GIN:\nFull-text GIN indexes first tokenize the indexed field (to_tsvector). Example: doc_tsv below is the tokenized state of left:\nctid | left | doc_tsv -------+----------------------+--------------------------------------------------------- (0,1) | Can a sheet slitter | \u0026#39;sheet\u0026#39;:3,6 \u0026#39;slit\u0026#39;:5 \u0026#39;slitter\u0026#39;:4 (0,2) | How many sheets coul | \u0026#39;could\u0026#39;:4 \u0026#39;mani\u0026#39;:2 \u0026#39;sheet\u0026#39;:3,6 \u0026#39;slit\u0026#39;:8 \u0026#39;slitter\u0026#39;:7 (0,3) | I slit a sheet, a sh | \u0026#39;sheet\u0026#39;:4,6 \u0026#39;slit\u0026#39;:2,8 (1,1) | Upon a slitted sheet | \u0026#39;sheet\u0026#39;:4 \u0026#39;sit\u0026#39;:6 \u0026#39;slit\u0026#39;:3 \u0026#39;upon\u0026#39;:1 (1,2) | Whoever slit the she | \u0026#39;good\u0026#39;:7 \u0026#39;sheet\u0026#39;:4,8 \u0026#39;slit\u0026#39;:2 \u0026#39;slitter\u0026#39;:9 \u0026#39;whoever\u0026#39;:1 (1,3) | I am a sheet slitter | \u0026#39;sheet\u0026#39;:4 \u0026#39;slitter\u0026#39;:5 (2,1) | I slit sheets. | \u0026#39;sheet\u0026#39;:3 \u0026#39;slit\u0026#39;:2 (2,2) | I am the sleekest sh | \u0026#39;ever\u0026#39;:8 \u0026#39;sheet\u0026#39;:5,10 \u0026#39;sleekest\u0026#39;:4 \u0026#39;slit\u0026#39;:9 \u0026#39;slitter\u0026#39;:6 (2,3) | She slits the sheet | \u0026#39;sheet\u0026#39;:4 \u0026#39;sit\u0026#39;:6 \u0026#39;slit\u0026#39;:2 Then indexing by tokens and their ctids:\n(https://postgrespro.com/blog/pgsql/4261647)\nThe index is sorted by token order, similar to BTREE; leaf nodes store ctids pointed to by tokens. Since the same token can come from multiple tuples, a token can point to multiple ctids. When multiple ctids exist, a posting tree is built — essentially a BTREE of ctids within.\nFulltext GIN addressing:\nfor \u0026ldquo;mani\u0026rdquo; — (0,2). for \u0026ldquo;slitter\u0026rdquo; — (0,1), (0,2), (1,2), (1,3), (2,2).\nGIN updates:\nUpdating (insert/update/delete) a text generally requires updating many places in the GIN index because:\nOne text can have many tokens scattered across GIN index branches One token may contain multiple ctids since many texts share that token This makes GIN updates very expensive. Batch updates are typically better than row-by-row updates since some tokens are shared, reducing update work.\nBesides batch updates, GIN provides fast update functionality (fastupdate = true):\n（https://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf）\nGIN fast update:\nIncrementally updated data goes to a separate, unsorted area When vacuum runs or the list reaches gin_pending_list_limit, incremental updates are written back to the main GIN index GiST or GIN?\nBoth GiST and GIN are generalized index frameworks supporting full-text indexing, but their full-text index structures are completely different. GIST suits geographic and multi-dimensional spatial data; GIN mainly indexes scenarios where a key contains multiple values, such as arrays, full text, JSON.\nGIN indexes are faster than GiST; generally, full-text indexing can blindly choose GIN (reference: GIST vs GIN) Only with very frequent updates should GiST be considered, assuming fast update strategy can\u0026rsquo;t solve the update problem (e.g., configuring nightly write-back strategy). Better to compare GiST and GIN for various full-text indexing scenarios. https://www.postgresql.org/docs/16/datatype-textsearch.html\nhttps://postgrespro.com/blog/pgsql/4261647\n(6) BRIN\n（https://postgrespro.com/blog/pgsql/5967830）\nBRIN is not a tree-type index. Data is grouped in multiple pages (or blocks) as one range (similar to range partition but not physically partitioned). The table is divided into ranges, hence the name Block Range Index (BRIN).\nThe most critical BRIN component is the revmap layer, which stores only key value ranges and ctids, not the key values themselves. This is why BRIN indexes are very small — storing key values would make it like a branch-less BTREE.\nSince only key value ranges and ctids are stored, data lookup requires accessing all data pages pointed to by matching revmap pages, then rechecking for final data rows.\nQUERY PLAN ---------------------------------------------------------------------------------- Bitmap Heap Scan on flights_bi (actual time=75.151..192.210 rows=587353 loops=1) Recheck Cond: (airport_utc_offset = \u0026#39;08:00:00\u0026#39;::interval) Rows Removed by Index Recheck: 191318 Heap Blocks: lossy=13380 -\u0026gt; Bitmap Index Scan on flights_bi_airport_utc_offset_idx (actual time=74.999..74.999 rows=133800 loops=1) Index Cond: (airport_utc_offset = \u0026#39;08:00:00\u0026#39;::interval) Whether index key order matches storage order is critical. For example, non-sequentially stored extra key value data may be on \u0026ldquo;distant\u0026rdquo; pages, requiring extra IO to access distant data pages. Worst case, it may scan the entire table:\n（https://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf）\nBRIN suitable scenarios:\nBRIN indexes only suit data where index key order is highly consistent with storage order. Check the column\u0026rsquo;s correlation in pg_stats — should approach 1 (maybe -1 also works?), typically auto-increment primary keys and timestamp columns Nearly no update scenarios. Updates may reduce correlation BRIN indexes generally suit very large data, especially TB-scale and beyond https://postgrespro.com/blog/pgsql/5967830\n(7) RUM\nRUM is an extension, not natively included in PG. RUM and GIN indexes are similar except RUM additionally stores tsvector position information.\nAlthough GIN requires to_tsvector() (or direct tsvector) for tokenization, GIN doesn\u0026rsquo;t use the position information from to_tsvector(). For example, finding the distance between two tokens can\u0026rsquo;t be done with GIN — only via raw to_tsvector() data. RUM handles this.\nRUM indexes attach token position information alongside ctids, compared to GIN:\n（https://postgrespro.com/blog/pgsql/4262305）\nRUM, similar to GIN, suits full-text indexing, with additional capabilities:\nDistance operators (e.g., \u0026lt;=\u0026gt;) for distance calculation Position-based sorting https://postgrespro.com/blog/pgsql/4262305\n(8) BLOOM\nA Bloom filter quickly determines whether an element is in a set. Bloom filters can have false positives — \u0026ldquo;in set\u0026rdquo; isn\u0026rsquo;t guaranteed true, but \u0026ldquo;not in set\u0026rdquo; is guaranteed true. BLOOM indexes are also non-tree, flat structures (requiring recheck like BRIN).\n（https://en.wikipedia.org/wiki/Bloom_filter）\nBloom indexes can index many columns. Similar to hash indexes, but unlike hash indexes, they can specify hashed fields and combine them, with total length limited by the length parameter. Because of the segmented hashing and truncation, false positives exist. Shorter length means higher false positive probability (max length 4096 bits).\ncreate index on ... using bloom(...) with (length=..., col1=..., col2=..., ...); （https://postgrespro.com/blog/pgsql/5967832）\nhttps://www.postgresql.org/docs/current/bloom.html\nhttps://postgrespro.com/blog/pgsql/5967832\nSummary:\nIndex Type Structure Operators Access Complexity Native? Ordered? Accurate? Applicable Scenarios Advantages Disadvantages btree btree; branch stores key ranges, leaf nodes store keys and ctids, generally ascending \u0026gt;=, =, IS NULL etc. common operators; leftmost prefix rule O(logN) Yes Yes Yes High selectivity scenarios; not suitable for too-large data Fits most scenarios; no extra sorting needed Large key values make index very large; index fragmentation/splitting (HOT mitigates) hash Builds hash buckets; different hash values point to different rows Only = O(1) Yes No Yes Only = condition scenarios; large key values Generally small; fast access Very narrow use case GiST Index framework; R-TREE, RD-TREE; groups addresses at different layers for precision Spatial operators: \u0026lt;-\u0026gt; distance, \u0026lt;\u0026lt; left-of, @\u0026gt; contains etc. Layer height Yes Yes (supports KNN) Yes GIS; KNN; frequently updated full-text index GIS, multi-dimensional data Special-case scenarios sp-GiST/Q-tree (sp-GiST is framework; index excludes overlapping data) Q-tree: each node has 4 internal nodes Spatial operators: up/down/left/right, equality, contains Layer height Yes Yes Yes GIS GIS GIS sp-GiST/k-d tree k-d tree: splits multi-dimensional space at nodes until no further split Spatial operators Min O(k), avg O(logN), max O(N/2) Yes Yes Yes GIS; multi-dimensional data GIS, multi-dimensional data Special-case scenarios sp-GiST/radix-tree radix-tree: each child synthesizes its parent Common operators: =, \u0026gt;, ~ etc. Min O(1), max O(N) Yes Yes Yes Scenarios without common data Supports common operators beyond GIST Limited scenarios; can be very slow GIN Index framework; similar to btree: branch stores token ranges, leaf stores tokens and ctids; one token pointing to multiple ctids may have sub-posting-tree; fast update enabled adds linked-list space for incremental data Operators vary slightly by data type; generally @@ contains Related to text length/token repetition; approx O(logN) Yes No (branches ordered but no token position info) Yes Key-contains-multiple-values scenarios: array, full text, JSON, many columns Best choice for multi-value key scenarios Updates need proper strategy BRIN Non-tree: groups data pages by range; rev index layer stores only key ranges and ctids Common operators: \u0026lt; \u0026lt;= = \u0026gt;= \u0026gt; Page lookup O(1); data return O(N), N=recheck rows Yes Not strictly ordered, only suits ordered data No Sequential storage (time-series, auto-increment); very large tables; nearly no updates; range queries Very small index Extremely demanding on correlation RUM Similar to GIN, but additionally stores token position info Includes GIN operators plus position operators Related to text length/token repetition; approx O(logN) No Yes (supports KNN lookup) Yes Key-contains-multiple-values scenarios; suitable for KNN Stores position info beyond GIN Requires extension installation BLOOM Each field hashed and truncated; non-tree, bitmap filtering Common operators: \u0026lt; \u0026lt;= = \u0026gt;= \u0026gt; Miss: O(1); hit: O(N), N=recheck rows Yes No No Suitable for miss scenarios Can be very fast Can be very slow on recheck Additional index section references:\nTypes of PostgreSQL Indexes. Short and clear\nhttps://leopard.in.ua/2015/04/13/postgresql-indexes\nhttps://pic.huodongjia.com/ganhuodocs/2017-07-15/1500104265.79.pdf\nhttps://developer.aliyun.com/article/698090?spm=a2c6h.12873639.article-detail.43.702e7149IBMYL9\nhttps://postgresql.us/events/pgopen2019/sessions/session/647/slides/45/look-it-up.pdf\nhttps://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf\n17. How Row Locks Are Implemented, Whether Stored in Shared Memory # Row locks in PG are in the row header, not implemented in memory.\n(1) After t1 updates without committing, it acquires exclusive locks on relation and transactionid:\n(2) t2 updating the same row gets blocked; this blocking is implemented via transactionid sharelock. t2 acquires both relation and tuple locks:\n(3) t3 updating this row gets blocked via tuple exclusive lock:\nIn summary, PG row locks are implemented jointly via transactionid locks, relation locks, and tuple locks:\n《postgresql-internals-14》\nhttps://postgrespro.com/blog/pgsql/5968005\n18. Differences Between Streaming Replication and Logical Replication, and Their Applicable Scenarios # Streaming replication here generally refers to PG physical replication, synchronizing full WAL logs downstream for replay by the downstream PG instance at the physical block level:\nLogical replication requires logically decoding transaction information from WAL for relevant tables, ordering transactions via reorder buffer, then outputting data in the form determined by the output plugin. The downstream need not be a PG instance. Must have replication slots managing logical decoding, output plugin, reorder buffer, replication positions, etc., plus knowledge of replica identity, slot/sender status, and more:\nLogical replication has many issues but is increasingly widely used and is a key focus area for PG community updates.\nFor example (incomplete list):\nlogical_decoding_work_mem is no longer hardcoded 4096 (changes); it\u0026rsquo;s now a configurable GUC parameter. Decoding spill issues are somewhat mitigated PG14+ supports streaming logical replication: uncommitted transactions can transmit data downstream; subsequent commit info determines whether to apply the changes Standby servers support replication slots; logical replication can be established on standbys Failover slots (in progress?) Many more updates\u0026hellip; PG流复制详解\npg内功修炼：逻辑复制\n19. What Is Streaming Replication Conflict and Why It Occurs # Cause of conflict:\nThe standby is running a query on a table (from application or manual connection). Meanwhile, the primary executes DROP TABLE, written to WAL and transmitted to the standby for replay. To ensure data consistency, PostgreSQL must rapidly replay WAL. The DROP TABLE and SELECT then conflict. Since the primary doesn\u0026rsquo;t know the standby\u0026rsquo;s transaction state, and the standby must stay consistent with the primary, \u0026ldquo;query conflict\u0026rdquo; occurs.\nConflict scenarios:\nPrimary exclusive locks (including explicit LOCK commands and various DDL) Primary vacuum cleaning dead tuples — if the standby is using those tuples, conflict arises Primary drops a tablespace that the standby query is using Primary drops a database that the standby is using Mitigating query conflicts (can\u0026rsquo;t fully resolve):\nhot_standby_feedback: standby periodically notifies the primary of the minimum active transaction ID (xmin), preventing the primary vacuum from cleaning tuples older than the xmin value.\nmax_standby_streaming_delay: standby queries aren\u0026rsquo;t immediately canceled; instead wait for a period before throwing an error if not finished.\nmax_standby_archive_delay: waiting time before canceling standby queries due to conflicts from processing archived WAL logs.\nvacuum_defer_cleanup_age: specifies how many transactions vacuum delays dead tuple cleanup by; i.e., vacuum and vacuum full won\u0026rsquo;t immediately clean just-deleted tuples.\nPG流复制详解\n20. PostgreSQL Permission System Overview # Hard to summarize comprehensively; it\u0026rsquo;s somewhat complex. Key points:\nPermission access requires each layer to be \u0026ldquo;open\u0026rdquo;; none can be missing Best to separate read-only/read-write/owner users Read-only and read-write permissions can be managed via roles PostgreSQL学徒:又被权限搞晕了？拿捏！\n21. Common High Availability Solutions, Selection Criteria, Pros and Cons # HA selection considerations:\nSync mode choice, availability zones, cross-region multi-active Switchover, failover Load balancing, read/write separation Host, database, and application-level HA VIP switching, connection string HA, connection switching Solving single point of failure or split-brain; election mechanisms Below are some known architectures:\npgpool-II+watchdog:\n（https://www.pgpool.net/docs/latest/en/html/example-cluster.html）\nPros: automatic failover, read/write separation, load balancing, watchdog election Cons: complex configuration, pgpool doesn\u0026rsquo;t fully support all PG features, pgpool performance overhead, depends on watchdog election\npatroni+etcd:\nPros: GUI (patroni), automatic failover, majority election Cons: learning curve, doesn\u0026rsquo;t support other databases (patroni)\npatroni+pgbouncer+haproxy+etcd:\n（https://www.percona.com/sites/default/files/eBook-PostgreSQL-High-Availability.pdf）\nPros: open-source stack: haproxy for load balancing, pgbouncer for connection pooling, patroni for cluster management, etcd for election Cons: very complex configuration\nPing An Financial Cloud rasesql architecture:\n（https://www.ocftcloud.com/ssr/help/database/RASESQL/intro.Architecture）\nPros: failover support, simple architecture Cons: same-city remote can\u0026rsquo;t directly read-only access, higher resource usage, no election (?)\nAlibaba Cloud Polar-X:\n（PolarDB for PostgreSQL 三节点功能介绍）\nPros: read/write separation, can add non-voting nodes, failover, logger nodes participate in election/data flow/backup Cons: \u0026hellip;\nGoogle Cloud PG:\nThree architecture options:\nGoogle Cloud Native Architecture (MIG):\nPros: three options to choose from, well-documented! (the other two derive from open-source architectures with similar pros/cons; MIG cloud-native approach described below) MIG advantages: doesn\u0026rsquo;t depend on PG native HA; uses Regional persistent disk for data HA. Primary zone network isolation; disk can be attached to zone B in the same region (within 1 minute). MIG disadvantages: no read replicas; only within-region failover (no multi-region deployment)\nAurora for PG:\nPros: simple architecture, recovered primary node auto-joins cluster, multi-region deployment, standby readable Cons: (seemingly) no election mechanism; docs heavy on text, light on diagrams\n崔健：PostgreSQL的高可以架构设计与实践\nhttps://www.pgpool.net/docs/latest/en/html/example-cluster.html\n汪总： Postgresql 高可用\n使用Patroni和HAProxy创建高度可用的PostgreSQL集群\nhttps://www.percona.com/sites/default/files/eBook-PostgreSQL-High-Availability.pdf\nPolarDB for PostgreSQL 三节点功能介绍\nhttps://cloud.google.com/architecture/architectures-high-availability-postgresql-clusters-compute-engine\nhttps://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.html\n22. Five Levels of synchronous_commit; Why Standby Queries Can\u0026rsquo;t Immediately See Primary Inserts # PG流复制详解\n23. Transaction ID Wraparound Causes and Maintenance Optimization # Why transaction ID wraparound exists:\nEvery non-query transaction consumes a transaction ID. Query transactions consume virtual transaction IDs (VXID), which are locally counted. Though VXID has wraparound issues, session restart resets VXID counting, so it\u0026rsquo;s rarely problematic.\nHowever, transaction IDs have an upper limit. TransactionId is a 32-bit unsigned integer, storing 2^32=4294967296 — about 4.2 billion transactions. At this point, transaction IDs must wrap around to the initial state, which is why transaction IDs form a ring.\nDue to visibility rules, the 4.2 billion transactions must be split in half: one half represents the future, the other the past. The difference between max and min transactions in a PG instance cannot exceed 2.1 billion — hence the 2.1 billion transaction limit.\n（https://www.interdb.jp/pg/pgsql05/01.html）\nTransaction ID freezing:\nDue to visibility rules, if a visible row (e.g., xid=100) differs from the latest transaction by more than 2.1 billion, it becomes invisible:\n（Forgot the source; look it up）\nTo solve this, the transaction ID freezing mechanism was introduced. Freezing sets the xmin of overly old tuples to FrozenXID=2, older than all normal transactions. That is, txid=2 is visible to all normal transactions (txid\u0026gt;=3). In version 9.4+, t_infomask\u0026rsquo;s xmin_frozen flag indicates frozen tuples rather than rewriting t_xmin to 2.\nLazy mode: The VM file was originally designed to reduce vacuum overhead by letting vacuum skip pages with no dead tuples (all-visible). Later (pg9.4), the freeze process was enhanced so lazy mode freezing can also skip all-visible pages during vacuum.\nLazy mode freeze trigger: triggered alongside vacuum operation (seems to have no independent trigger condition???)\nLazy mode freeze which tuples: except pages marked all-visible in VM that get skipped, freezes tuples whose xmin-to-active-transaction-ID (actually oldestxmin) gap exceeds vacuum_freeze_min_age (default 50M), marking them xmin_frozen. In the diagram below, tuple 9\u0026rsquo;s xmin=3000 won\u0026rsquo;t be frozen.\nLazy mode is more of a vacuum side-effect: since we\u0026rsquo;re already concurrently vacuum scanning and cleaning dead tuples with pages already scanned, we might as well freeze eligible tuples.\nEager mode: Lazy mode has a problem: it works alongside vacuum, skipping pages with no dead tuples (all-visible). If a page contains only live tuples (all-visible but not all-frozen) with very old xmin values, lazy mode alone can\u0026rsquo;t freeze them. So eager mode is needed: skip pages already marked all-frozen in VM and freeze the rest. In real scenarios, eager mode is typically the one running periodically and requiring attention: even if only one page in a table has tuples that are all inserts (even just one static page), eager mode is needed.\nEager mode freeze triggers:\nVacuum_freeze_table_age for vacuum operations: when the database-level minimum xmin (actually pg_database.datfrozenxid, also the minimum of all pg_class.relfrozenxid in that database) and the active transaction ID (actually oldestxmin) gap exceeds Vacuum_freeze_table_age (default 150M), vacuum triggers eager mode freezing.\nautovacuum_freeze_max_age for autovacuum: whether lazy mode or eager mode Vacuum_freeze_table_age, vacuum must first be triggered. Relying solely on vacuum\u0026rsquo;s own trigger conditions for freezing is unreliable; a freeze-specific deadline parameter is needed: autovacuum_freeze_max_age. When tuple age exceeds autovacuum_freeze_max_age (200M), autovacuum is force-triggered for freezing. Even if autovacuum is disabled, this deadline-triggered freeze still works.\nEager mode freeze which tuples: similar to lazy mode, except for all-frozen pages (lazy uses all-visible — different), freezes tuples whose xmin-to-active-transaction-ID gap exceeds vacuum_freeze_min_age (default 50M). In the diagram, tuple 11 is not frozen.\nvacuum freeze command: VACUUM FREEZE is equivalent to setting vacuum_freeze_min_age and vacuum_freeze_table_age to 0, performing eager mode freezing for all inactive xmin tuples.\nvacuum_failsafe_age: Since large table vacuum operations are very slow, freeze may not finish before transaction ID wraparound occurs. Because freeze is done by the vacuum process, and vacuum has many other operations and parameter settings, to accelerate freeze, cost-based vacuuming, buffer strategy, and index vacuuming are all ignored. Parameter default is 1.6B; actually, during vacuum the effective value is no lower than autovacuum_freeze_max_age * 105%.\nCLOG may also be updated: Additionally, if freezing updates pg_database.datfrozenxid, unnecessary CLOG is also cleaned. CLOG records transaction status for determining \u0026ldquo;relatively new\u0026rdquo; transaction and tuple visibility. If a database\u0026rsquo;s frozenxid has been advanced recently, meaning those \u0026ldquo;old\u0026rdquo; tuples have been marked as frozen — always visible — then \u0026ldquo;old\u0026rdquo; transaction status info in CLOG can be discarded.\nMaintenance optimization: (summarized from Can Zong\u0026rsquo;s summary)\nMonitor pg_database.frozenxid in production. When approaching trigger values, proactively run VACUUM FREEZE during low-traffic windows rather than waiting for passive database triggers. Partition tables; overly large tables cause long prevent-wraparound operations Set different vacuum ages for large tables: ALTER TABLE test SET (autovacuum_freeze_max_age=xxxx); User-scheduled freeze: during low-traffic windows, VACUUM FREEZE large, aged tables Watch for freeze-blocking scenarios: long transactions, replication slots, hot_standby_feedback, pg_dump, cursors, orphan transactions Set sufficient worker processes to avoid vacuum scenarios queuing If load is a concern, consider enabling cost-based vacuuming (vacuum_cost_delay etc.) autovacuum_freeze_max_age should exceed vacuum_freeze_table_age to leave room for manual vacuum. Official recommendation: vacuum_freeze_table_age = 0.95 * autovacuum_freeze_max_age; if vacuum_freeze_table_age is below 0.95 * autovacuum_freeze_max_age, vacuum still takes 0.95 * autovacuum_freeze_max_age. vacuum_failsafe_age: PG14+ set reasonable vacuum_failsafe_age to accelerate large table freeze and prevent wraparound; should exceed autovacuum_freeze_max_age * 105%. https://www.interdb.jp/pg/\nhttps://www.postgresql.org/docs/16/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND\n深入理解PostgreSQL冻结炸弹\npg事务：事务ID\n24. Vacuum / Autovacuum Functions and Tuning # Functions:\nClean up \u0026ldquo;dead tuples\u0026rdquo; left by UPDATE or DELETE operations Track available space in table blocks, update free space map Update visibility map needed for index-only scans \u0026ldquo;Freeze\u0026rdquo; rows in tables to prevent transaction ID wraparound Periodically ANALYZE to update statistics Tuning:\nSet sufficient worker processes to avoid vacuum queuing Increase maintenance_work_mem (or autovacuum_work_mem) Watch for vacuum-blocking scenarios: long transactions, replication slots, hot_standby_feedback, pg_dump, cursors, orphan transactions For special tables (business-sensitive, large), set separate autovacuum trigger thresholds (threshold, fillfactor; insert threshold, fillfactor): dead tuple cleanup threshold, stats update threshold, wraparound prevention threshold For special tables, disable per-table autovacuum and run vacuum during off-peak hours for dead tuple cleanup, statistics, and wraparound If business load is a concern, enable cost-based vacuuming with sleep at thresholds Partition tables to avoid vacuum running endlessly or restarting immediately after finishing Avoid VACUUM FULL (8-level lock). Use logical replication + rename or pg_repack for table/index bloat handling, improving efficiency and reclaiming space 25. Function Volatility Categories and Why Functions Need EXECUTE # VOLATILE (unstable, default):\nCan do anything, including modifying the database Within the same transaction, even with identical parameters, may return different results Obtains a snapshot for each query execution within the function, so even identical interactive queries within the same function may produce different results due to changing visible data Since recalculation is needed each time, the optimizer can\u0026rsquo;t pre-estimate; performance may be poor Function indexes not supported Typical functions: timeofday(), random(), all modifying functions STABLE:\nCannot modify the database Within the same transaction, identical parameters return identical results. Snapshot obtained at function start; internal queries don\u0026rsquo;t re-obtain; identical interactive queries within the function produce consistent results Function indexes not supported Typical functions: current_timestamp family; regardless of how many times called within a transaction, only one value IMMUTABLE (very stable):\nCannot modify the database Given identical parameters, always returns identical results. Snapshot acquisition principle same as STABLE Key difference from STABLE: IMMUTABLE not only caches the plan but reuses this plan in subsequent executions Function indexes supported Some database-parameter-dependent functions shouldn\u0026rsquo;t be marked IMMUTABLE, e.g., timezone-related functions should be STABLE Typical function: calculating 1+2 Why functions need EXECUTE:\nPREPARE: parsed, analyzed, and rewritten\nEXECUTE: planned and executed\nForcing SQL hard parsing: prevents SQL from using incorrect execution plans due to data skew.\nUnlike plain SQL, plpgsql defaults to Plan Caching, automatically executing SQL as PREPARE, attempting to generate and cache generic plans for soft parsing. However, with data skew, cached execution plans may be inefficient and unacceptable for core business. In such cases, consider using EXECUTE statements to force per-variable-value execution plans, improving accuracy.\nhttps://blog.csdn.net/Hehuyi_In/article/details/128885660\nhttps://www.postgresql.org/docs/16/xfunc-volatility.html\n26. Why Use CREATE INDEX CONCURRENTLY and Its Hazards # Why CIC:\nCREATE INDEX requires a ShareLock, which conflicts with DML\u0026rsquo;s RowExclusiveLock. So online business shouldn\u0026rsquo;t directly use CREATE INDEX. CIC uses ShareUpdateExclusiveLock, which doesn\u0026rsquo;t conflict with DML locks, so CIC is recommended for index creation.\nCIC process:\nInsert index metadata into system catalogs (pg_class, pg_index), then open two transactions for two scans Open transaction 1, get snapshot1 Before scanning table, wait for all transactions that modified the table (insert/delete/update) to finish Scan table and build index End transaction 1 Open transaction 2, get snapshot2 Before second scan, wait for all transactions that modified the table to finish DML on the table from transactions started after snapshot2 will update this index Second table scan, update index (version numbers from tuples allow identifying records changed between snapshot1 and snapshot2, merging them into the index) After index update, wait for transactions holding snapshots that started before transaction 2 to finish End index creation. Index becomes visible. CIC issues:\nOpens two transactions sequentially, scanning the table one extra time vs CREATE INDEX Must wait for long transactions to finish before scanning can begin CIC-created indexes may become invalid CIC interrupted abnormally leaves an invalid index During CIC unique index creation, inserted/updated data violating unique constraints also causes CIC failure leaving an invalid index Invalid indexes still get updated by DML Partition parent tables don\u0026rsquo;t support CIC index creation; create indexes with CIC on child partitions one by one, then create the index on the parent with ONLY 学徒 深度剖析CIC\n27. HOT Principle # HOT:\nWithout HOT, every tuple update would update indexes. Below, one additional updated tuple adds one index entry, and the old index entry points to the dead tuple. This causes index update, index space, and index vacuum pressure.\nWith HOT, in-page updates only update the tuple, not the index:\nHOT tuples correspond to HEAP_HOT_UPDATED and HEAP_ONLY_TUPLE bits in infomask:\nlzldb=\u0026gt; create table tt(a int); lzldb=\u0026gt; create index idxtt on tt(a); lzldb=\u0026gt; insert into tt values(1); lzldb=\u0026gt; update tt set a=1; -- execute multiple times lzldb=\u0026gt; select * from tt; -- after update, run a visibility check to write remaining clog commit info to tuple header lzldb=\u0026gt; SELECT lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags, t_ctid, raw_flags, combined_flags FROM heap_page_items(get_raw_page(\u0026#39;tt\u0026#39;, 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; lp | lp_flags | t_ctid | raw_flags | combined_flags ----+-----------+--------+-----------------------------------------------------------------------------------------+---------------- 1 | LP_NORMAL | (0,2) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_HOT_UPDATED} | {} 2 | LP_NORMAL | (0,3) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} 3 | LP_NORMAL | (0,4) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} 4 | LP_NORMAL | (0,5) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} 5 | LP_NORMAL | (0,5) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} lp(line pointer)=1\u0026rsquo;s tuple points to row 2 via ctid(0,2); row 2 points to row 3\u0026hellip; ultimately to row 5. ctid forms a chain pointing to the final data row. Dead tuples all carry HEAP_HOT_UPDATED, indicating the tuple is an updated row on the HOT chain; the chain tail has HEAP_ONLY_TUPLE, marking the end of the HOT chain.\nWith HOT, vacuum only cleans dead tuples within the page without updating indexes:\nlzldb=\u0026gt; vacuum tt; VACUUM lzldb=\u0026gt; SELECT lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags, t_ctid, raw_flags, combined_flags FROM heap_page_items(get_raw_page(\u0026#39;tt\u0026#39;, 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; lp | lp_flags | t_ctid | raw_flags | combined_flags ----+-----------+--------+----------------------------------------------------------------------+---------------- 5 | LP_NORMAL | (0,5) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} After vacuum, dead tuples are cleaned.\nOn subsequent updates, a new HOT chain begins:\nlzldb=\u0026gt; update tt set a=1; lzldb=\u0026gt; update tt set a=1; lzldb=\u0026gt; select * from tt; lzldb=\u0026gt; SELECT lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags, t_ctid, raw_flags, combined_flags FROM heap_page_items(get_raw_page(\u0026#39;tt\u0026#39;, 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; lp | lp_flags | t_ctid | raw_flags | combined_flags ----+-----------+--------+-----------------------------------------------------------------------------------------+---------------- 2 | LP_NORMAL | (0,3) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} 3 | LP_NORMAL | (0,3) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE} | {} 5 | LP_NORMAL | (0,2) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_UPDATED,HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE} | {} Why doesn\u0026rsquo;t the new HOT chain start from lp1? Because lp1 is already occupied — the index still points to lp1.\nlzldb=\u0026gt; SELECT itemoffset, ctid, data, dead, htid, tids[0:2] AS some_tids FROM bt_page_items(\u0026#39;idxtt\u0026#39;,1); itemoffset | ctid | data | dead | htid | some_tids ------------+-------+-------------------------+------+-------+----------- 1 | (0,1) | 01 00 00 00 00 00 00 00 | f | (0,1) | htid (0,1) is page 0, lp 1. Vacuum only cleaned the data page; the index was not updated. Vacuum only cleaned dead tuples and the middle of the HOT chain; HOT chain head and tail ctids were untouched.\nINDEX ONLY SCAN:\nIndex-only scan is a common and efficient scan method across databases: it returns results by accessing only index pages without touching data pages. However, this is problematic in PG because visibility information is stored in data page headers, not index pages. Accessing only the index can\u0026rsquo;t support MVCC in principle.\nThe VM file not only supports vacuum skipping all-visible pages but also supports INDEX ONLY SCAN for visibility determination on all-visible pages:\nReference: interdb\n28. Does PostgreSQL Have Lock Escalation? # Basically no.\nOnly Predicate lock has escalation. Predicate lock is used when serializable isolation is needed, intended to lock predicates and prevent data anomalies to achieve serializability. In PG, this corresponds to SIReadLock.\nPredicate lock\u0026rsquo;s finest granularity is locking rows within a range When row count exceeds a threshold, lock the corresponding page When page count exceeds a threshold, lock the corresponding table Predicate lock has only 3 lock levels: row, page, table https://postgrespro.com/blog/pgsql/5968020\n29. Replication Slot Functions and Hazards # For physical replication, replication slots aren\u0026rsquo;t strictly necessary; hot_standby_feedback and other parameters can manage WAL. With replication slots, those parameters become unnecessary — slots manage WAL logs.\nFor logical replication, replication slots are mandatory; one logical replication link corresponds to one slot. For logical replication, slots manage not only WAL logs but also logical decoding, output plugin, decoding/sending positions (LSN), allowing retransmission of decoded logs after replication interruption.\nReplication slot hazards:\nActually, replication slots have no inherent hazards. Their primary function is simplifying WAL log management. Without slots, you still need WAL management strategies. The PG community recommends using slots. Just note: always clean up unused slots to prevent them holding old positions that block WAL cleanup, filling the disk. Additionally, DBAs shouldn\u0026rsquo;t casually drop slots — once dropped, position info is lost, and downstream links may need data reinitialization and resynchronization. Better to confirm whether the replication link can restart syncing.\npg内功修炼：逻辑复制\n30. Why Deadlocks Occur and Deadlock Detection Mechanism # Simplest case: transaction T1 holds resource 1, transaction T2 holds resource 2. If T1 tries to acquire resource 2 and T2 tries to acquire resource 1, a deadlock forms. Without management, deadlocks can wait indefinitely, so all DBMS have deadlock detection. Deadlocks usually indicate business logic issues. If no explicit cancellation of one transaction in the \u0026ldquo;ring\u0026rdquo; breaks it, PG auto-detects deadlocks and force-terminates one transaction via the deadlock_timeout parameter (default 1s); other transactions in the \u0026ldquo;ring\u0026rdquo; can continue.\nhttps://postgrespro.com/blog/pgsql/5968020\n31. SQL Performance Troubleshooting Approaches # 32. Why Use Partitioned Tables, Advantages and Disadvantages # Partitioned tables split table data into smaller physical fragments to improve performance, availability, and manageability, transparent to applications. Partitioned tables are a common optimization for large tables in relational databases. DBMS generally provide partition management, and applications can directly access partitioned tables without architecture changes — though good performance requires proper partition access patterns.\nPG natively supports declarative partitioning and inheritance partitioning. Common plugin-based implementations include pg_pathman. PG10 introduced declarative partitioning with many enhancements in subsequent versions (see PostgreSQL Partitioned Tables — History). PG12+ with declarative partitioning is recommended.\nAdvantages of partitioned tables:\nSQL performance improvement. In some scenarios, e.g., splitting large data into multiple partitions where SQL only queries one partition, partition pruning can dramatically improve performance Partitions work with indexes. Accessing one partition\u0026rsquo;s index is more efficient than accessing an unpartitioned large index Dropping a partition is more efficient than deleting many rows. Common in time-range partitioning: dropping an unused historical partition is very fast, while DELETE without partitions is slow and requires extra maintenance Faster vacuum. Vacuuming a large table for old version cleanup or statistics collection can be very slow; SQL problems may arise before vacuum finishes. With partitions, vacuum is much faster IO distribution. Different partitions can be placed on different paths/disks. Rarely used data can go on cheaper disks More maintenance techniques. Directly maintaining a huge table is very difficult (e.g., vacuuming an extremely large table has many issues), while partitioned table partitions can be vacuumed individually. Also, attach/detach, local indexes/constraints etc. can be flexibly used May enable partition-wise join or partition-wise aggregation features Disadvantages of partitioned tables:\nIn PG, partitions are also tables; too many tables cause slow parsing and large relcache metadata caching Too many tables may cause errors. Reference: 较少的分区也报错too many range table entries Even if partition count doesn\u0026rsquo;t error, without partition pruning during plan generation (may happen at execution), EXPLAIN output becomes very large, and logs become bloated with long plans Strange issues: 不同用户查看到不同的执行计划 Major limitations of PG native partitioned tables:\nNo native automatic partition creation Only local indexes supported, no global indexes Primary key must include the partition key. PostgreSQL currently can only enforce uniqueness within individual partitions, hence this limitation. Oracle and MySQL don\u0026rsquo;t have this restriction Unique index must include the partition key (same reason as primary key) Cannot create global constraints (child tables inherit but can\u0026rsquo;t create table-level global constraints) Partitioned table maintenance:\nNew partitions without data: directly use PARTITION OF (8-level lock; just watch for long transactions) New partitions with data: use ATTACH (4-level lock, doesn\u0026rsquo;t block reads/writes) to add; if needed, pre-add partition constraints to reduce constraint check time. DETACH CONCURRENTLY (4-level lock) to remove partitions Note: ATTACH doesn\u0026rsquo;t auto-create indexes, constraints, defaults, or row-level triggers like PARTITION OF does; create them beforehand Partition parent table indexes don\u0026rsquo;t support CIC. Correct approach for partition index creation: 1) create ONLY on parent 2) create CONCURRENTLY on partitions 3) ATTACH all partition indexes to the parent; the index auto-marks as valid Increasing column length won\u0026rsquo;t rebuild indexes, EXCEPT for partitioned tables where it WILL rebuild indexes PostgreSQL分区表\n33. Soft Parsing vs Hard Parsing Concepts # Hard parsing: For a SQL statement, the optimizer must first perform lexical and syntax analysis, converting it into a query tree PG can understand, then rewrite and optimize it, generating an execution plan tree before the executor can execute. This full parsing process is called hard parsing.\nSoft parsing: Obviously, performing such complex steps for every statement each time would be very inefficient. So PG caches SQL execution plans in process memory. When certain conditions are met, cached plans can be used directly, improving efficiency. This is soft parsing.\nPG bind-variable SQL parsing: the five-time mechanism:\nThe five-time mechanism prevents data skew from causing inefficient execution plans.\nFirst 5 executions: each generates an execution plan based on actual bound variables (called custom plans) — this is hard parsing. 6th execution: generates a generic execution plan (generic plan) and compares it with the previous 5 plans.\nIf not worse than the first 5: the 6th plan is fixed; subsequently, regardless of parameter changes, the SQL execution plan won\u0026rsquo;t change — this is soft parsing If worse than any of the first 5 plans: every subsequent execution regenerates the plan — all hard parsing Forcing soft/hard parsing:\nPG 12 introduced the force_custom_plan parameter with options:\nauto: default, uses the five-time mechanism force_custom_plan: always hard parse; suitable for SQL with data skew where performance and stability are critical force_generic_plan: always use generic plan; suitable for SQL without data skew or where performance/stability requirements are lower PG 14 added generic_plans and custom_plans columns to pg_prepared_statements, showing counts for both plan types. Since PG execution plans are only cached in-process, pg_prepared_statements only shows the current session\u0026rsquo;s SQL, not other sessions or global info.\nFive-time mechanism source code:\n/* * choose_custom_plan: choose whether to use custom or generic plan * * This defines the policy followed by GetCachedPlan. */ static bool choose_custom_plan(CachedPlanSource *plansource, ParamListInfo boundParams) { double\tavg_custom_cost; ... /* Let settings force the decision */ if (plan_cache_mode == PLAN_CACHE_MODE_FORCE_GENERIC_PLAN) return false; if (plan_cache_mode == PLAN_CACHE_MODE_FORCE_CUSTOM_PLAN) return true; /* See if caller wants to force the decision */ if (plansource-\u0026gt;cursor_options \u0026amp; CURSOR_OPT_GENERIC_PLAN) return false; if (plansource-\u0026gt;cursor_options \u0026amp; CURSOR_OPT_CUSTOM_PLAN) return true; /* Generate custom plans until we have done at least 5 (arbitrary) */ if (plansource-\u0026gt;num_custom_plans \u0026lt; 5) return true; avg_custom_cost = plansource-\u0026gt;total_custom_cost / plansource-\u0026gt;num_custom_plans; /* * Prefer generic plan if it\u0026#39;s less expensive than the average custom * plan. (Because we include a charge for cost of planning in the * custom-plan costs, this means the generic plan only has to be less * expensive than the execution cost plus replan cost of the custom * plans.) * * Note that if generic_cost is -1 (indicating we\u0026#39;ve not yet determined * the generic plan cost), we\u0026#39;ll always prefer generic at this point. */ if (plansource-\u0026gt;generic_cost \u0026lt; avg_custom_cost) return false; return true; } Hehuyi_In 软硬解析的概念\n34. What Are VM / FSM / INIT Files # Numeric suffix: Files fork when exceeding 1GB (default); changeable at build time via ./configure --with-segsize\nVM: Visibility map, containing all-visible and all-frozen info. Helps: 1) accelerate vacuum scanning (skip all-visible pages) 2) accelerate eager freeze (skip all-frozen pages) 3) support INDEX ONLY SCAN (all-visible pages don\u0026rsquo;t need page access for tuple visibility checks)\nFSM: Free space map, helping PG locate free space on pages. For index pages, since indexes are ordered, recording per-page free space is less meaningful; index FSM files only contain fully empty index pages.\nINIT: A fork file only for unlogged tables, size 0, marking the data file as unlogged.\n《postgresql-internals-14》\n35. Memory Reclaim Mechanisms: kswapd / Direct Memory Reclaim / pdflush # Memory reclaim mechanisms:\nBackground memory reclaim (kswapd): When physical memory is tight, the kswapd kernel thread is woken to reclaim memory asynchronously, not blocking process execution. Direct memory reclaim: If background async reclaim can\u0026rsquo;t keep up with process memory allocation requests, direct reclaim begins — synchronous, blocking process execution.\n（https://vivani.net/2022/06/14/linux-kernel-tuning-page-allocation-failure/)\npages_low: When available free pages drop below pages_low, buddy allocator wakes kswapd; kernel begins swapping pages to disk. pages_min: When available pages reach pages_min, page reclaim pressure is high because the memory zone urgently needs free pages. Allocator performs kswapd work synchronously — sometimes called direct reclaim. pages_high: Once kswapd is woken and releasing pages, only when available pages reach pages_high does the kernel consider the zone \u0026ldquo;balanced\u0026rdquo;. At pages_high, kswapd re-enters sleep. Free pages above pages_high mean the zone state is ideal. Memory reclaim operates per-zone; /proc/zoneinfo shows min, low, high values.\nvm.min_free_kbytes (the min_pages line) is a critically important OS parameter. Very low values prevent effective system memory reclamation, potentially causing crashes and service interruptions. Excessively high values increase reclaim activity, causing allocation latency and potentially immediate out-of-memory states.\npdflush and kcompactd:\npdflush: pagecache dirty pages must be written to disk. Whether via sync (fsync etc.), OS-scheduled flushing, or database commits, ultimately the Linux kernel thread pdflush handles the flushing work.\nkcompactd: page compaction specifically targets memory fragmentation cleanup (flushing also works since memory returns to the buddy system). Unlike pdflush flushing, memory compaction doesn\u0026rsquo;t require disk writes.\nObserving memory reclaim:\nsar is one of the most comprehensive Linux system performance analysis tools, reporting on multiple dimensions: file read/write, syscall usage, disk I/O, CPU efficiency, memory usage, process activity, and IPC.\nsar -B observes kswapd and direct memory reclaim:\nExample: sar viewing memory page status sar -B 1 3\npgpgin/s: KB read from disk/SWAP into memory per second pgpgout/s: KB written from memory to disk/SWAP per second fault/s: page faults per second (major + minor) majflt/s: major page faults per second pgfree/s: pages placed in free queue per second pgscank/s: pages scanned by kswapd per second pgscand/s: pages directly scanned per second pgsteal/s: pages cleared from cache per second to meet memory needs %vmeff: percentage of stolen pages (pgsteal) vs total scanned (pgscank + pgscand) Example: sar viewing historical memory info sar -B -s \u0026quot;08:00:00\u0026quot; -e \u0026quot;10:00:00\u0026quot;\n# Without -e means from start time to now $ sar -B -s \u0026#34;08:00:00\u0026#34; 09:45:01 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 09:46:01 PM 414429.37 395024.08 179478.63 0.07 352922.62 12003.78 4266.52 16269.42 99.99 09:47:01 PM 879907.08 337948.43 157970.97 0.02 402290.21 0.00 0.00 0.00 0.00 09:48:01 PM 772977.43 507343.30 150255.50 0.05 466742.08 0.00 5821.28 5821.27 100.00 Strong recommendation: linux内存浅析\n36. Process Scheduling, D Process Hazards and Causes # Not fully understanding what \u0026ldquo;process scheduling\u0026rdquo; specifically refers to here; I\u0026rsquo;ll answer in terms of IPC (Inter-Process Communication).\nIPC:\nSince user space in virtual address space can\u0026rsquo;t be accessed by other user processes, achieving multi-process user access to the same memory data via kernel space inevitably involves context switching (as shown on the right below). Multi-process applications clearly need inter-process access, so a method enabling user processes to directly access the same physical memory emerged: shared memory (as shown on the left below).\nShared memory is one IPC (Inter-Process Communication) mechanism; others include message queues and semaphores. Shared memory is one of the fastest IPC mechanisms because it doesn\u0026rsquo;t require inter-process data copying — processes access shared memory through their own address spaces.\n（https://www.geeksforgeeks.org/inter-process-communication-ipc/）\nShared memory has many implementations. In PG, shared_buffer defaults to mmap for shared memory (corresponds to shared_memory_type); parallel queries default to POSIX (corresponds to dynamic_shared_memory_type).\n(https://momjian.us/main/writings/pgsql/inside_shmem.pdf)\nD Process:\nD process meaning: Uninterruptible sleep state. Indicates the process is waiting for an external event to complete, such as disk I/O or network requests. Normally, D processes cannot be directly terminated.\nCauses of D processes: The process is waiting for an external event, typically direct memory reclaim — synchronous and blocking application disk access. At that moment, disk-access-related processes are in D state. Note: D processes are triggered at the OS or hardware level, largely unrelated to the application itself (a little). For example, a PG large query session itself won\u0026rsquo;t produce D processes and can be killed.\nlinux内存浅析\nPostgreSQL内存浅析\n37. Packet Capture and Analysis of PostgreSQL Protocol # PG supported protocols:\nConnection protocols: TCP/IP: PostgreSQL\u0026rsquo;s most common communication method, allowing client-server network connections and data exchange. Unix domain socket: For same-host client-server connections, faster than TCP/IP. SSL/TLS: PostgreSQL supports SSL/TLS encryption on TCP/IP connections for data transmission security. TLS is SSL\u0026rsquo;s successor; PG (seemingly) no longer supports SSL protocol itself, though related parameters remain for TLS use. Password authentication protocols: MD5: As the earlier default password authentication protocol, MD5 (Message Digest Algorithm 5) stores and verifies user passwords server-side. SCRAM-SHA-256: A more secure authentication protocol using SHA-256 hashing and challenge-response for user authentication. PG10+ gradually replaces MD5. Simple packet capture analysis:\ntcpdump capture command:\ntcpdump tcp port 5432 -i lo -s0 -nSX -vvv Capture a count(*) (already connected to database via psql -h):\nlzldb=\u0026gt; select count(*) from t1; -- just capture this count ------- 4 Captured content:\n15:51:34.828820 IP (tos 0x0, ttl 64, id 29027, offset 0, flags [DF], proto TCP (6), length 82) 172.18.10.85.37240 \u0026gt; 172.18.10.85.postgres: Flags [P.], cksum 0x6d13 (incorrect -\u0026gt; 0x57c6), seq 1091052893:1091052923, ack 3014367256, win 350, options [nop,nop,TS val 92480460 ecr 92427582], length 30 0x0000: 4500 0052 7163 4000 4006 5c74 ac12 0a55 E..Rqc@.@.\\t...U 0x0010: ac12 0a55 9178 1538 4108 255d b3ab 9818 ...U.x.8A.%].... 0x0020: 8018 015e 6d13 0000 0101 080a 0583 23cc ...^m.........#. 0x0030: 0582 553e 5100 0000 1d73 656c 6563 7420 ..U\u0026gt;Q....select. 0x0040: 636f 756e 7428 2a29 2066 726f 6d20 7431 count(*).from.t1 0x0050: 3b00 ;. 15:51:34.830090 IP (tos 0x0, ttl 64, id 49370, offset 0, flags [DF], proto TCP (6), length 115) 172.18.10.85.postgres \u0026gt; 172.18.10.85.37240: Flags [P.], cksum 0x6d34 (incorrect -\u0026gt; 0x6e5c), seq 3014367256:3014367319, ack 1091052923, win 342, options [nop,nop,TS val 92480461 ecr 92480460], length 63 0x0000: 4500 0073 c0da 4000 4006 0cdc ac12 0a55 E..s..@.@......U 0x0010: ac12 0a55 1538 9178 b3ab 9818 4108 257b ...U.8.x....A.%{ 0x0020: 8018 0156 6d34 0000 0101 080a 0583 23cd ...Vm4........#. 0x0030: 0583 23cc 5400 0000 1e00 0163 6f75 6e74 ..#.T......count 0x0040: 0000 0000 0000 0000 0000 1400 08ff ffff ................ 0x0050: ff00 0044 0000 000b 0001 0000 0001 3443 ...D..........4C 0x0060: 0000 000d 5345 4c45 4354 2031 005a 0000 ....SELECT.1.Z.. 0x0070: 0005 49 ..I 15:51:34.830098 IP (tos 0x0, ttl 64, id 29028, offset 0, flags [DF], proto TCP (6), length 52) 172.18.10.85.37240 \u0026gt; 172.18.10.85.postgres: Flags [.], cksum 0x6cf5 (incorrect -\u0026gt; 0x5cb9), seq 1091052923, ack 3014367319, win 350, options [nop,nop,TS val 92480461 ecr 92480461], length 0 0x0000: 4500 0034 7164 4000 4006 5c91 ac12 0a55 E..4qd@.@.\\....U 0x0010: ac12 0a55 9178 1538 4108 257b b3ab 9857 ...U.x.8A.%{...W 0x0020: 8010 015e 6cf5 0000 0101 080a 0583 23cd ...^l.........#. 0x0030: 0583 23cd ..#. Reading packets visually\u0026hellip; simple analysis shows this count statement only generated 3 packets, and you can even see the select.count(*).from.t1 statement.\nWireshark packet analysis:\nWindow 1:\ntcpdump tcp port 5432 -i lo -s0 -nSX -vvv -w tcpdump.cap Window 2:\n[postgres@iZ2vcdugd3f2h0t7x20pqmZ data]$ psql -h 172.18.10.85 -p 5432 -d lzldb -U lzl -- step 1, connect Password for user lzl: -- step 2, enter password lzldb=\u0026gt; select count(*) from t1; -- step 3, query count ------- 4 lzldb=\u0026gt; \\q -- step 4, exit Note 4 steps, corresponding to at least 4 packet sections:\nStep 1 - connection request Step 2 - password entry Step 3 - SQL query Step 4 - disconnect Now analyze tcpdump.cap with Wireshark.\nStep 1 - Connection Request [1-10] — TCP three-way handshake [1-3]: 37282-\u0026gt;5432 sends SYN, seq=0 5432-\u0026gt;37282 sends SYN+ACK, seq=0 ack=1 37282-\u0026gt;5432 sends ACK, seq=1 ack=1 （https://www.researchgate.net/publication/340247809_Computer_Network_Chapter_8_Transport_Layer_UDP_and_TCP）\nStep 1 - Connection Request [1-10] — PGSQL protocol startup and authentication request [4-7]: After the three-way handshake, PSQL client immediately sends a PGSQL protocol startup message to PG server [4], info: \u0026gt;?, the protocol startup message.\nThe above \u0026gt;? packet is 37282-\u0026gt;5432. You don\u0026rsquo;t need to check source/destination in Transmission Control Protocol. PGSQL protocol shows even less info than TCP, but it has direction: \u0026gt; means 37282-\u0026gt;5432, \u0026lt; means 37282\u0026lt;-5432.\nNext PGSQL protocol message is authentication request [6], info: \u0026lt;R, 37282\u0026lt;-5432.\nStep 1 - Connection Request [1-10] — Three-way FIN [8-10]. After server sends PGSQL authentication request to client, client requests TCP disconnect, 3 TCP FINs (not 4; explained below). Note: at this point psql command line is waiting for password input\u0026hellip; Step 2 - Password Entry [11-22] — Three-way handshake [11-13]. Because the first TCP connection ended, establishing a connection again starts from TCP\u0026hellip; so another three-way handshake: Step 2 - Password Entry [11-22] — Password authentication [14-22]. Authentication phase is slightly more complex. [14-16] essentially does the same as [4-7] in step 1: client requests PGSQL protocol startup, server returns authentication request. Then [18-20] performs password authentication using SCRAM-SHA-256 mechanism; password authentication actually transmits 4 packets, including [21]\u0026rsquo;s two R authentication messages. Then [21] connection established: first two R\u0026rsquo;s are authentication complete; many S\u0026rsquo;s represent Parameter status: application name, charset, timezone, etc.; K represents Backend key, returning forked backend PID; Z represents ready for query. Step 3 - SQL Query [23-25] [23] Q clearly represents Query, client sends packet containing SQL; [24] returns results: T represents Row Description (here only column name \u0026ldquo;count\u0026rdquo;); D represents data row, the count result is 4, data is plaintext unencrypted:\nC represents Command complete; Z represents ready.\nStep 4 - Disconnect [26-29]. [26] client actively sends session end message, PGSQL protocol (corresponds to \\q); [27-29] again 3 TCP FINs. Why three FINs instead of four?\n\u0026ldquo;No more data to send\u0026rdquo; AND \u0026ldquo;TCP delayed ACK mechanism enabled\u0026rdquo; means the second and third FINs merge, resulting in three FINs:\n（TCP 四次挥手，可以变成三次吗？）\nSince TCP delayed ACK is enabled by default, three-FIN scenarios appear more often than four-FIN in captures.\nOK, simple PG packet capture and analysis complete. Summary network transmission diagram for this session: Packet capture analysis notes:\nFirst understand the link; typically many nodes exist between application clients and database servers: network switches, request forwarding services, etc. Capture on both ends simultaneously when possible Pay attention to capture timing and set appropriate filters Possible packet loss points:\nhttps://mp.weixin.qq.com/s/dF4juaW-ttI0Zn1j0z6tag\nPacket loss involves NICs, drivers, and kernel protocol stack — each layer can lose packets:\nBetween two VM connections, transmission failures may occur: network congestion, line errors, etc. After NIC receives packets, the ring buffer may overflow and drop packets At IP layer: routing failures, packet size exceeding MTU, etc. At transport layer: port not listening, resource usage exceeding kernel limits, etc. At socket layer: socket buffer overflow and packet loss At application layer: application exceptions causing packet loss References:\nTcpdump一次抓包记录（Postgresql通信）\n学徒 DBA必备技能之网络丢包分析总结\nPgSQL协议分析:网络抓包\nTCP 四次挥手，可以变成三次吗？\n38. Storage: SAN / NAS / DAS # 39. Lifecycle of an IO Request # （https://blog.csdn.net/Hehuyi_In/article/details/100715177?spm=1001.2014.3001.5501）\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/postgresql-interview-questions-comprehensive-collection/","section":"Posts","summary":"Interview questions source: PostgreSQL Apprentice PostgreSQL Interview Questions Collection\nExisting answers: Hehuyi_In Learning and Answering PostgreSQL Interview Questions\n1. MVCC Implementation and Differences from Oracle # ORACLE and MYSQL both use UNDO to implement multi-version concurrency control. Undo entries are recorded in additional undo tablespaces. If the UNDO segment is insufficient, an ora-01555 error occurs. https://www.slideshare.net/AmitBhalla2/less10-undo-15946188\nPostgreSQL has no undo mechanism. To ensure transaction rollback, old tuples remain on the table. For example, an update inserts a new row while the old data stays in place. Tuple headers, clog, etc. determine which tuple version is valid. Visibility information in tuple headers includes xmin, xmax, cmin, cmax, infomask, and infomask2, stored in the tuple header.\n","title":"PostgreSQL Interview Questions - Comprehensive Collection","type":"posts"},{"content":" Localization Concepts # The purpose of localization is to support the language features and rules of different countries and regions. With localization support, you can use character sets that handle Chinese, French, Japanese, and more. Beyond character sets, there are also character sorting rules and other language-related rule support. For example, we know how to sort (\u0026lsquo;a\u0026rsquo;, \u0026lsquo;b\u0026rsquo;), but how should (\u0026lsquo;a\u0026rsquo;, \u0026lsquo;A\u0026rsquo;) and (\u0026lsquo;啊\u0026rsquo;, \u0026lsquo;阿\u0026rsquo;) be sorted?\nIf you search Google for information about localization, character sets, and collation, you might end up with knowledge that feels both complex and distant. The best teacher is still Localization knowledge is divided into three parts: locale support, collation, and character sets.\nlocale # PostgreSQL\u0026rsquo;s localization is provided by the operating system. You need to check whether the OS supports it via locale -a. The locale can be specified when initializing the database:\ninitdb --locale=en_US You can also set localization subcategories individually: string sort order, character classification, numeric formatting, date formatting, time formatting, currency formatting, etc.\ninitdb --locale=zh_CN --lc-monetary=en_US All localization subcategories:\nSubcategory Rule LC_COLLATE String sort order LC_CTYPE Character classification (What is a letter? Its upper-case equivalent?) LC_MESSAGES Language of messages LC_MONETARY Formatting of currency amounts LC_NUMERIC Formatting of numbers LC_TIME Formatting of dates and times These subcategories can be split into two groups. lc_messages, lc_monetary, lc_numeric, and lc_time can be adjusted via parameters after initialization. LC_COLLATE and LC_CTYPE belong to collation — see the collation section for adjustment details.\nLocale settings affect the following behaviors:\nSort order in queries using ORDER BY or the standard comparison operators on textual data The upper, lower, and initcap functions Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions The to_char family of functions The ability to use indexes with LIKE clauses COLLATION # Collation defines the sort order of characters and character classification behavior. Some database operators depend on collation, such as ORDER BY, lower, upper, initcap, to_char, and others.\nUse the following SQL to query the system table pg_collation to get LC_COLLATE and LC_CTYPE information for supported character sets:\nselect pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation where collname in (\u0026#39;default\u0026#39;,\u0026#39;C\u0026#39;,\u0026#39;POSIX\u0026#39;,\u0026#39;en_US.utf8\u0026#39;,\u0026#39;zh_CN.utf8\u0026#39;,\u0026#39;zh_CN.gb2312\u0026#39;,\u0026#39;zh_SG.gb2312\u0026#39;) ; encoding | collname | collcollate | collctype ----------+--------------+--------------+-------------- | default | | | C | C | C | POSIX | POSIX | POSIX UTF8 | en_US.utf8 | en_US.utf8 | en_US.utf8 EUC_CN | zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312 UTF8 | zh_CN.utf8 | zh_CN.utf8 | zh_CN.utf8 EUC_CN | zh_SG.gb2312 | zh_SG.gb2312 | zh_SG.gb2312 encoding is the character set, and collname is the collation name.\nWhen encoding is empty, it means this collation supports all character sets. default, C, POSIX are collations supported on all platforms, provided by libc. Other collations depend on whether the operating system supports them (locale -a). default means using the collation set at database creation time, which can be viewed via \\l. C is semantically equivalent to POSIX, but PostgreSQL still considers them different collations. They both compare characters by ASCII code, strictly by byte order. =\u0026gt; SELECT \u0026#39;a\u0026#39; COLLATE \u0026#34;C\u0026#34; \u0026lt; \u0026#39;b\u0026#39; COLLATE \u0026#34;POSIX\u0026#34; ; ERROR: 42P21: collation mismatch between explicit collations \u0026#34;C\u0026#34; and \u0026#34;POSIX\u0026#34; LINE 1: SELECT \u0026#39;a\u0026#39; COLLATE \u0026#34;C\u0026#34; \u0026lt; \u0026#39;b\u0026#39; COLLATE \u0026#34;POSIX\u0026#34; ; LOCATION: merge_collation_state, parse_collate.c:834 UTF8 is the most common character set, and the most common language environments are en_US and zh_CN. You can create custom collations via CREATE COLLATION .... However, cases where LC_COLLATE and LC_CTYPE differ are very rare. LC_COLLATE # LC_COLLATE affects character comparison (sorting, character operations, etc.).\nThe COLLATE clause can transform the collation of an expression:\nexpr COLLATE collation Note that this specifies a collation, not lc_collate. If no collation is explicitly specified, the database uses the column\u0026rsquo;s collation by default. If the column has no collation specified, it uses the database\u0026rsquo;s default collation.\nSorting test with different collations:\nselect col1 from (values (\u0026#39;a\u0026#39;), (\u0026#39;A\u0026#39;), (\u0026#39;啊\u0026#39;), (\u0026#39;阿\u0026#39;)) -\u0026gt; AS l(col1) -\u0026gt; order by col1 collate \u0026#34;C\u0026#34;; col1 ------ A a 啊 阿 select col1 from (values (\u0026#39;a\u0026#39;), (\u0026#39;A\u0026#39;), (\u0026#39;啊\u0026#39;), (\u0026#39;阿\u0026#39;)) -\u0026gt; AS l(col1) -\u0026gt; order by col1 collate \u0026#34;en_US.utf8\u0026#34;; col1 ------ a A 啊 阿 select col1 from (values (\u0026#39;a\u0026#39;), (\u0026#39;A\u0026#39;), (\u0026#39;啊\u0026#39;), (\u0026#39;阿\u0026#39;)) -\u0026gt; AS l(col1) -\u0026gt; order by col1 collate \u0026#34;zh_CN.utf8\u0026#34;; col1 ------ a A 阿 啊 These three different collations have different lc_collate values, and the sort methods are indeed different — we can see three distinct sort results from the output.\nWhy does collation C put \u0026lsquo;A\u0026rsquo; before \u0026lsquo;a\u0026rsquo;? Collation C uses ASCII encoding order. In ASCII, uppercase letters come before lowercase. Meanwhile, en_US.utf8 and zh_CN.utf8 clearly do not follow this order for English letters.\nOrder of Chinese characters Even with the same UTF8 character set, the order of Chinese characters differs between Chinese and English locales. Different lc_collate values correspond to different alphabets for different localized languages. The sort order with lc_collate=C is always by byte order. Although ASCII does not include Chinese, C can still sort Chinese — (essentially) every Chinese character maps to a UTF8 encoding, and C sorts by byte order.\nLC_CTYPE # LC_CTYPE affects character operations (such as upper, initcap, etc.).\nIf the string is all English, e.g., 'abcD', initcap converts it to 'Abcd' under all three collations — nothing special to show here.\nBut when Chinese is introduced, the results differ:\nselect initcap(\u0026#39;啊aAAa阿bBBb\u0026#39; collate \u0026#34;C\u0026#34;); initcap -------------- 啊Aaaa阿Bbbb select initcap(\u0026#39;啊aAAa阿aAAa\u0026#39; collate \u0026#34;en_US.utf8\u0026#34;); initcap -------------- 啊aaaa阿aaaa select initcap(\u0026#39;啊aAAa阿aAAa\u0026#39; collate \u0026#34;zh_CN.utf8\u0026#34;); initcap -------------- 啊aaaa阿aaaa When LC_CTYPE=C, initcap capitalizes the first letter of every non-contiguous English character sequence, whereas en_US.utf8 and zh_CN.utf8 only capitalize the very first character (Chinese characters remain unchanged) and lowercase other English characters.\nThe behavior of initcap with Chinese may be an undefined requirement, but we can conclude: different LC_CTYPE settings lead to different results from character-sensitive functions like initcap.\nFurthermore, Chinese is case-insensitive, but some other localized languages do have case distinctions — different LC_CTYPE settings lead to even more complex outcomes.\nCharacter Sets # Character Set Basics # PostgreSQL supports different character sets (also called encodings). Character sets and collation are two separate concepts, but the character set must be compatible with LC_CTYPE and LC_COLLATE. As seen in pg_collation, C/POSIX support all character sets, while other collations only support one character set (on Linux systems).\nChinese-related character sets available in PostgreSQL: *(The C collation is provided by the libc library; some collations can be provided by the ICU library, requiring compilation in advance.)\nName Description Language Server-side support? ICU support? Bytes/Char Aliases BIG5 Big Five Traditional Chinese No No 1–2 WIN950, Windows950 EUC_CN Extended UNIX Code-CN Simplified Chinese Yes Yes 1–3 GB2312 GB18030 National Standard Chinese No No 1–4 GBK Extended National Standard Simplified Chinese No No 1–2 WIN936, Windows936 UTF8 Unicode, 8-bit all Yes Yes 1–4 Unicode Traditional Chinese: BIG5 is the most common character set standard for Traditional Chinese. It was once the industry standard and was later incorporated as a national standard.\nSimplified Chinese: GB stands for \u0026ldquo;Guobiao\u0026rdquo; (national standard). GB2312, GB18030, and GBK are all Chinese national character set standards. Due to issues such as rare characters and years of development producing several historical versions, there appear to be multiple standards. EUC_CN stands for Extended UNIX Code-CN, which is essentially GB2312, but it cannot handle all rare characters either. Similarly named encodings include EUC_KR, EUC_JP, EUC_TW, and so on.\nInternational Standards: The character sets above are all national standards — they support English and Chinese but not other languages. The international standard that supports all languages of the world is Unicode (which even includes emoji \u0026#x1f44d;). (There is also the well-known international standards organization ISO, which maintains character sets as well — there is some overlap, but we\u0026rsquo;ll set ISO aside for now.)\nDue to different Unicode encoding schemes, there are three encoding formats: UTF-8, UTF-16, and UTF-32.\nUTF-8 encoding format:\nBytes Format Actual encoding bits Code point range 1 byte 0xxxxxxx 7 0 ~ 127 2 byte 110xxxxx 10xxxxxx 11 128 ~ 2047 3 byte 1110xxxx 10xxxxxx 10xxxxxx 16 2048 ~ 65535 4 byte 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 65536 ~ 2097151 UTF8 encoding is variable-length. For characters in the range 0x00-0x7F (1 byte), UTF-8 encoding is exactly identical to ASCII (American Standard Code for Information Interchange). Therefore, UTF-8 is fully backward-compatible with ASCII.\nDue to shared origins, meanings, and similarities, Chinese, Japanese, Korean, and Vietnamese characters use a unified encoding in Unicode called CJK Unified Ideographs (CJKV Unified Ideographs). CJK Unified Ideographs encoding ranges: 3400-4DBF/4E00-9FFF/20000-3FFFF.\nCharacter Set Conversion # When server_encoding and client_encoding differ, automatic conversion of the character set returned by the server can occur. For setting server-side and client-side character sets, see the \u0026ldquo;Configuring Character Sets\u0026rdquo; section.\nChinese-related character sets — Server/Client convertible table:\nServer Character Set Available Client Character Sets BIG5 not supported as a server encoding EUC_CN (GB2312) EUC_CN (GB2312), MULE_INTERNAL, UTF8 GB18030 not supported as a server encoding GBK not supported as a server encoding UTF8 all supported encodings GB18030 and GBK are not supported on the server side, so in practice only EUC_CN (GB2312) and UTF8 can perform Server/Client conversion. The above lists the character sets that can be converted, but conversion still requires CONVERSION support. PostgreSQL has built-in conversion functions visible via pg_conversion:\nConversion Name Source Encoding Destination Encoding big5_to_utf8 BIG5 UTF8 euc_cn_to_utf8 EUC_CN UTF8 gb18030_to_utf8 GB18030 UTF8 gbk_to_utf8 GBK UTF8 utf8_to_big5 UTF8 BIG5 utf8_to_euc_cn UTF8 EUC_CN utf8_to_gb18030 UTF8 GB18030 utf8_to_gbk UTF8 GBK You can create custom conversions via the CREATE CONVERSION statement, specifying the conversion function.\nSome character sets appear to be interconvertible, but the server side doesn\u0026rsquo;t support storing them at all (such as BIG5, GB18030, GBK), so it\u0026rsquo;s not practically useful. All we need to know here is that euc_cn and utf8 can be converted to/from each other.\nWithout CONVERSION support, conversion cannot happen:\n-- EUC_CN database =\u0026gt; \\encoding EUC_KR EUC_KR: invalid encoding name or conversion procedure not found Character set conversion test: Pay attention to the client-side character set settings (e.g., CRT\u0026rsquo;s \u0026ldquo;session\u0026rdquo; - \u0026ldquo;Appearance\u0026rdquo; - \u0026ldquo;Character encoding\u0026rdquo;)\nThere are at least three endpoints with character set concepts: database server, database client, and UI client. CONVERSION only controls: database server → database client.\nServer with UTF8 conversion test: create table zh(col1 varchar(20)); insert into zh values(\u0026#39;\u0026gt;\u0026#39;),(\u0026#39;阿\u0026#39;),(\u0026#39;〇\u0026#39;); -- 〇 (líng) is a Chinese character -- If CRT is not set to UTF8, Chinese characters are all garbled; only set CRT to UTF8 for insertion =\u0026gt; show server_encoding; server_encoding ----------------- UTF8 =\u0026gt; show client_encoding; client_encoding ----------------- UTF8 -- With no conversion at all, UTF8 displays correctly. Currently three endpoints: UTF8 - UTF8 - UTF8 =\u0026gt; select * from zh; col1 ------ \u0026gt; 阿 〇 -- Switch database client character set. Now three endpoints: UTF8 - EUC_CN - UTF8 =\u0026gt; \\encoding EUC_CN; -- Set client character set =\u0026gt; select * from zh where col1 in (\u0026#39;阿\u0026#39;); ERROR: 22021: invalid byte sequence for encoding \u0026#34;EUC_CN\u0026#34;: 0xe9 0x98 LOCATION: report_invalid_encoding, mbutils.c:1597 Time: 0.112 ms =\u0026gt; select * from zh where col1 in (\u0026#39;〇\u0026#39;); ERROR: 22021: invalid byte sequence for encoding \u0026#34;EUC_CN\u0026#34;: 0xe3 0x80 ERROR: 22021: invalid byte sequence for encoding \u0026#34;EUC_CN\u0026#34;: 0xe3 0x80 -- It looks like \u0026#34;阿\u0026#34; and \u0026#34;〇\u0026#34; cannot be converted to EUC_CN, but that\u0026#39;s not the whole story =\u0026gt; select * from zh limit 2; col1 ------ \u0026gt; \u0026lt;B0\u0026gt;\u0026lt;A2\u0026gt; (2 rows) -- The second row is \u0026#34;阿\u0026#34;. The database server/client appears to have converted the character set from UTF8 to EUC_CN. -- However, it may not display correctly due to UI client issues (currently CRT is set to UTF8) -- Even changing CRT to GB2312 still won\u0026#39;t display correctly select * from zh limit 2; col1 ------ \u0026gt; \u0026lt;B0\u0026gt;\u0026lt;A2\u0026gt; (2 rows) -- When querying 〇, the database throws an error directly, indicating 〇 cannot be converted from UTF8 to EUC_CN select * from zh ; ERROR: 22P05: character with byte sequence 0xe3 0x80 0x87 in encoding \u0026#34;UTF8\u0026#34; has no equivalent in encoding \u0026#34;EUC_CN\u0026#34; LOCATION: report_untranslatable_char, mbutils.c:1631 Server with EUC_CN conversion test: =\u0026gt; show server_encoding; -- Database has EUC_CN character set server_encoding ----------------- EUC_CN -- Create the same zh table under the EUC_CN database, but inserting already has issues =\u0026gt; insert into zh values(\u0026#39;〇\u0026#39;); ERROR: 22P05: character with byte sequence 0xe3 0x80 0x87 in encoding \u0026#34;UTF8\u0026#34; has no equivalent in encoding \u0026#34;EUC_CN\u0026#34; LOCATION: report_untranslatable_char, mbutils.c:1631 Again, the error says 〇 cannot be converted from UTF8 to EUC_CN. EUC_CN (GB2312) Chinese encoding is not fully identical to UTF8 — EUC_CN (GB2312) does not include all Chinese characters, especially rare ones.\nConfiguring locale, collation, and character set # Now that we\u0026rsquo;ve covered localization and character sets, here\u0026rsquo;s a summary.\nDatabase cluster locale, collation, character set # At initialization time, you can set the database cluster\u0026rsquo;s locale and character set:\ninitdb -D $DATADIR -E UTF8 --locale=en_US.UTF8 initdb -D $DATADIR -E UTF8 --locale=en_US.UTF8 --lc_collate=C --lc_ctype=C initdb -D $DATADIR -E UTF8 --locale=en_US.UTF8 --lc_collate=C --lc_ctype=C --lc-messages=en_US.UTF8 --lc-monetary=en_US.UTF8 --lc-numeric=en_US.UTF8 --lc-time=en_US.UTF8 initdb creates three databases: postgres, template1, and template0. The CREATE DATABASE statement defaults to using template1 to create databases.\nencoding sets the character set; locale sets LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME, unless specifically overridden (e.g., via --lc_collate).\nLC_COLLATE and LC_CTYPE are called collation and can also be set at the database, column, and index levels. LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME are instance parameters that can be changed at any time.\nencoding can only be set at initialization or at database creation — once set, it cannot be changed.\nDatabase collation and character set # When creating a database, you can set the database\u0026rsquo;s character set, lc_collate, and lc_ctype.\nBoth CREATE DATABASE and createdb can specify the character set at database creation time. Once created, the database character set cannot be changed. Both commands use a template database to create the new database.\nThere are two templates: template0 and template1. The official documentation states:\nAnother common reason for copying template0 instead of template1 is that new encoding and locale settings can be specified when copying template0, whereas a copy of template1 must use the same settings it does. This is because template1 might contain encoding-specific or locale-specific data, while template0 is known not to.\ntemplate1 is a writable template database that may contain localized data, while template0 cannot be written to. Therefore, to create a database with different localization settings, you should use template0.\nAnd you must explicitly use template0, because the default is template1. Attempting to create a database without specifying template1 and with a different character set will result in an error:\n=\u0026gt; create database db_GB2312 ENCODING \u0026#39;EUC_CN\u0026#39; LC_COLLATE \u0026#39;zh_CN.gb2312\u0026#39; LC_CTYPE \u0026#39;zh_CN.gb2312\u0026#39;; ERROR: 22023: new encoding (EUC_CN) is incompatible with the encoding of the template database (UTF8) HINT: Use the same encoding as in the template database, or use template0 as template. Additionally, you cannot set the character set by specifying locale when creating a database:\n=\u0026gt; create database db_GB2312 locale \u0026#39;zh_CN.gb2312\u0026#39; template \u0026#39;template0\u0026#39;; ERROR: 22023: encoding \u0026#34;UTF8\u0026#34; does not match locale \u0026#34;zh_CN.gb2312\u0026#34; DETAIL: The chosen LC_CTYPE setting requires encoding \u0026#34;EUC_CN\u0026#34;. LOCATION: check_encoding_locale_matches, dbcommands.c:773 The error indicates you need to specify the LC_CTYPE sub-option. Adding all collation-related sub-options still produces an error:\n=\u0026gt; create database db_GB2312 LOCALE \u0026#39;EUC_CN\u0026#39; LC_COLLATE \u0026#39;zh_CN.gb2312\u0026#39; LC_CTYPE \u0026#39;zh_CN.gb2312\u0026#39;; ERROR: 42601: conflicting or redundant options DETAIL: LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE. LOCALE cannot be used together with LC_CTYPE and other sub-options.\nRemoving locale and setting via character set, LC_COLLATE, and LC_CTYPE works successfully.\nThe correct way to create a database with a specific character set:\nCREATE DATABASE: create database db_GB2312 ENCODING \u0026#39;EUC_CN\u0026#39; LC_COLLATE \u0026#39;zh_CN.gb2312\u0026#39; LC_CTYPE \u0026#39;zh_CN.gb2312\u0026#39; template \u0026#39;template0\u0026#39;; createdb: Use the CLI command createdb, which wraps CREATE DATABASE — they are equivalent: createdb -E EUC_CN -T template0 --lc-collate=zh_CN.gb2312 --lc-ctype=zh_CN.gb2312 db_GB2312 Viewing database character set:\n\\l\npg_database\nselect datname,pg_encoding_to_char(encoding),datcollate,datctype,datlocprovider,daticulocale from pg_database; SHOW parameters SERVER_ENCODING, LC_COLLATE, and LC_CTYPE are all immutable parameters that display the current database\u0026rsquo;s server-side character set, LC_COLLATE, and LC_CTYPE, respectively.\nColumn collation # Collation is only related to character sorting and character functions — it is not related to encoding. Without indexes, changing a column\u0026rsquo;s collation is essentially just adjusting the default sort output for that column. With indexes, it will rebuild the index. If no collation is specified for a column, it defaults to the database\u0026rsquo;s collation.\nSpecifying collation when creating a table (note: some data types are un-collatable, such as int):\ncreate table t1(col1 varchar(10) collate \u0026#34;en_US.utf8\u0026#34;); alter table t1 alter column col1 type varchar(10) collate \u0026#34;C\u0026#34;; Note: ALTER TABLE without changing the length will not rewrite the table, but it will definitely rebuild the index.\nViewing a column\u0026rsquo;s default collation:\n1. \\d+ t1 2. information_schema.columns select table_catalog,table_schema,table_name,column_name,collation_name from information_schema.columns where table_name=\u0026#39;t1\u0026#39;; 3. pg_attribute select a.attrelid::regclass,a.attname,a.attcollation,c.collname,c.collcollate,c.collctype from pg_attribute a left join pg_collation c on a.attcollation=c.oid where a.attrelid::regclass=\u0026#39;tlzl\u0026#39;::regclass and a.attcollation\u0026lt;\u0026gt;0; Method 3 is recommended. While \\d+ and information_schema.columns can show collname, collname is not unique. Only method 3 reveals collate and ctype.\nTest: specifying collate and viewing pg_attribute:\ncreate table tlzl( col1 varchar(10) , col2 varchar(10) collate \u0026#34;C\u0026#34;, col3 varchar(10) collate \u0026#34;zh_CN\u0026#34;, col4 varchar(10) collate \u0026#34;en_US.utf8\u0026#34; ); -- Column collation is like tagging the column with a default sort order; you can\u0026#39;t see the specific collate and ctype db_utf8_c=\u0026gt; Table \u0026#34;public.tlzl\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------+-----------------------+------------+----------+---------+----------+-------------+--------------+------------- col1 | character varying(10) | | | | extended | | | col2 | character varying(10) | C | | | extended | | | col3 | character varying(10) | zh_CN | | | extended | | | col4 | character varying(10) | en_US.utf8 | | | extended | | | -- collname and collate/ctype are not one-to-one; col3\u0026#39;s zh_CN alone doesn\u0026#39;t reveal which collate is used db_utf8_c=\u0026gt; select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation where collname like \u0026#39;zh_CN%\u0026#39;; encoding | collname | collcollate | collctype ----------+--------------+--------------+-------------- EUC_CN | zh_CN | zh_CN | zh_CN EUC_CN | zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312 UTF8 | zh_CN.utf8 | zh_CN.utf8 | zh_CN.utf8 UTF8 | zh_CN | zh_CN.utf8 | zh_CN.utf8 -- pg_attribute shows more precisely than \\d+ db_utf8_c=\u0026gt; select a.attrelid::regclass,a.attname,a.attcollation,c.collname,c.collcollate,c.collctype from pg_attribute a left join pg_collation c on a.attcollation=c.oid where a.attrelid::regclass=\u0026#39;tlzl\u0026#39;::regclass and a.attcollation\u0026lt;\u0026gt;0; attrelid | attname | attcollation | collname | collcollate | collctype ----------+---------+--------------+------------+-------------+------------ tlzl | col1 | 100 | default | | tlzl | col2 | 950 | C | C | C tlzl | col4 | 12562 | en_US.utf8 | en_US.utf8 | en_US.utf8 tlzl | col3 | 13200 | zh_CN | zh_CN.utf8 | zh_CN.utf8 -- Now we know that col3 zh_CN\u0026#39;s collate is zh_CN.utf8 Test: table rewrite when modifying column collate:\n-- Add an index to the column and check rewrite behavior db_utf8_c=\u0026gt; create index idxcol4 on tlzl(col4); CREATE INDEX db_utf8_c=\u0026gt; select pg_relation_filepath(\u0026#39;tlzl\u0026#39;) TableRelid, pg_relation_filepath(\u0026#39;idxcol4\u0026#39;) IndexRelid; tablerelid | indexrelid ------------------+------------------ base/40996/41006 | base/40996/41015 db_utf8_c=\u0026gt; alter table tlzl alter column col4 type varchar(10) collate \u0026#34;C\u0026#34;; ALTER TABLE db_utf8_c=\u0026gt; select pg_relation_filepath(\u0026#39;tlzl\u0026#39;) TableRelid, pg_relation_filepath(\u0026#39;idxcol4\u0026#39;) IndexRelid; tablerelid | indexrelid ------------------+------------------ base/40996/41006 | base/40996/41016 -- Table was not rewritten; index was rewritten A column\u0026rsquo;s collation is merely a marker. Modifying the column\u0026rsquo;s collation does not rewrite the table, but if there is an index on it, the index will be rewritten (sometimes not — see the next section).\nIndex collation # When creating an index, if the index\u0026rsquo;s collation is not explicitly specified, the index uses the collation declared on the column.\nExplicitly specifying collation when creating an index:\ncreate index idx_C on tlzl(col3 collate \u0026#34;C\u0026#34;); Additionally, indexes can be created with text_pattern_ops, varchar_pattern_ops, bpchar_pattern_ops — in this case, the index does not depend on collation rules but compares character by character:\nThe difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules.\nCREATE INDEX test_index ON test_table (col varchar_pattern_ops); In fact, this type of index is not entirely unrelated to collation — an index always has a sort order. This type of index\u0026rsquo;s sort order appears to be consistent with C. See the \u0026ldquo;LIKE not using index\u0026rdquo; section.\nViewing an index\u0026rsquo;s collation:\n\\d+ -- \\d+ shows indexes with explicitly specified collate; if not specified, the column\u0026#39;s default collation is used db_utf8_c=\u0026gt; \\d+ tlzl Table \u0026#34;public.tlzl\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------+-----------------------+------------+----------+---------+----------+-------------+--------------+------------- col1 | character varying(10) | | | | extended | | | col2 | character varying(10) | C | | | extended | | | col3 | character varying(10) | zh_CN | | | extended | | | col4 | character varying(10) | en_US.utf8 | | | extended | | | Indexes: \u0026#34;idx_c\u0026#34; btree (col3 COLLATE \u0026#34;C\u0026#34;) \u0026#34;idxcol4\u0026#34; btree (col4) Access method: heap Viewing via pg_index is clearer (the indcollation type in pg_index is oidvector and cannot be directly cast to oid, making queries a bit cumbersome):\ndb_utf8_c=\u0026gt; select indcollation,indexrelid::regclass from pg_index where indexrelid::regclass =\u0026#39;idx_C\u0026#39;::regclass; indcollation | indexrelid --------------+------------ 950 | idx_c db_utf8_c=\u0026gt; select oid,pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation where oid=950; oid | encoding | collname | collcollate | collctype -----+----------+----------+-------------+----------- 950 | | C | C | C Also, you cannot change an index\u0026rsquo;s collation via ALTER INDEX — you must drop and recreate it.\nTest: After specifying an index collate, does modifying the column\u0026rsquo;s collate rewrite the index?\ndb_utf8_c=\u0026gt; select pg_relation_filepath(\u0026#39;tlzl\u0026#39;) TableRelid, pg_relation_filepath(\u0026#39;idxcol4\u0026#39;) IndexRelid4,pg_relation_filepath(\u0026#39;idx_c\u0026#39;) IndexRelidC; tablerelid | indexrelid4 | indexrelidc ------------------+------------------+------------------ base/40996/41020 | base/40996/41023 | base/40996/41024 (1 row) db_utf8_c=\u0026gt; alter table tlzl alter column col3 type varchar(10) collate \u0026#34;en_US.utf8\u0026#34;; ALTER TABLE db_utf8_c=\u0026gt; select pg_relation_filepath(\u0026#39;tlzl\u0026#39;) TableRelid, pg_relation_filepath(\u0026#39;idxcol4\u0026#39;) IndexRelid4,pg_relation_filepath(\u0026#39;idx_c\u0026#39;) IndexRelidC; tablerelid | indexrelid4 | indexrelidc ------------------+------------------+------------------ base/40996/41020 | base/40996/41023 | base/40996/41024 -- idx_c\u0026#39;s relfileid did not change If an index\u0026rsquo;s collate has been explicitly specified, modifying the column\u0026rsquo;s default collate will not rewrite that index.\nClient character set # When the client sets a character set different from the database, character set conversion occurs — though conversion may not always succeed. See the \u0026ldquo;Character Set Conversion\u0026rdquo; section for details.\nThe server-side character set cannot be changed after database creation, but the client character set can be adjusted at any time.\nThere are many ways to set the client character set:\nSet directly on the client: \\encoding UTF8 -- psql only SET CLIENT_ENCODING TO UTF8; -- session-level parameter change SET NAMES UTF8; -- SQL standard Set the PGCLIENTENCODING environment variable Set the client_encoding server configuration parameter Priority: client-side setting \u0026gt; PGCLIENTENCODING environment variable \u0026gt; client_encoding server configuration parameter\nViewing the client character set:\n\\encoding -- psql only SHOW client_encoding; Expression collate # Adding COLLATE to an expression overrides the expression\u0026rsquo;s original collation, effectively specifying a sort collation.\nAdd the COLLATE keyword at the end of the expression:\nexpr COLLATE collation -- For example select * from tab1 order by name COLLATE \u0026#34;C\u0026#34;; For details on sorting and collate index selection, see the \u0026ldquo;Sort Result Issues\u0026rdquo; section.\nMORE # Concept Summary # PostgreSQL localization has three important concepts: character set, locale, and collation — it\u0026rsquo;s essential to understand their relationships.\nThe server-side character set setting is very important: it can only be specified at initialization and database creation time, and cannot be modified after the database is created. The character set choice directly affects the encoding method. Collation does not, but there is a dependency between the two. Locale can likewise be specified at initialization, and among them, collation can be set at database creation time or individually on columns — note that these are merely defaults. Only when specifying collation at index creation does it affect the actual storage order. Different collations cannot use the same index, even if they share the same origin.\nClient character set and the four parameters (LC_MESSAGES, etc.) are relatively simple — they can be modified directly via parameters and are unrelated to data storage.\nSort Result Issues # Since UTF8 is the most common character set, we\u0026rsquo;ll test sorting with UTF-related collations:\ncreate database db_UTF8 ENCODING \u0026#39;UTF8\u0026#39; template \u0026#39;template0\u0026#39;; -- Create a UTF8 database; collation doesn\u0026#39;t matter use db_UTF8; create table tzlz(name varchar(10)); insert into tzlz values(\u0026#39;a\u0026#39;),(\u0026#39;aa\u0026#39;),(\u0026#39;A\u0026#39;),(\u0026#39;AA\u0026#39;),(\u0026#39;啊\u0026#39;),(\u0026#39;阿\u0026#39;),(\u0026#39;〇\u0026#39;); ORDER BY results with different collations:\nselect name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name; select name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name collate \u0026#34;C\u0026#34;; select name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name collate \u0026#34;en_US\u0026#34;; select name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name collate \u0026#34;en_US.utf8\u0026#34;; select name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name collate \u0026#34;zh_CN\u0026#34;; select name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name collate \u0026#34;zh_CN.utf8\u0026#34;; Order default C en_US en_US.utf8 zh_CN zh_CN.utf8 1 〇 A 〇 〇 a a 2 a AA a a A A 3 A a A A aa aa 4 aa aa aa aa AA AA 5 AA 〇 AA AA 阿 阿 6 啊 啊 啊 啊 啊 啊 7 阿 阿 阿 阿 〇 〇 Here, default is en_US.utf8 (column collation(default) → database collation(en_US.utf8))\n\u0026#x1f31f; C, en_US.utf8, and zh_CN.utf8 all produce different sort results!\nCollate and index scan test:\ninsert into tzlz values(generate_series(1,10000)); create index idxzlz_default on tzlz(name); create index idxzlz_C on tzlz(name collate \u0026#34;C\u0026#34;); create index idxzlz_enUS_utf8 on tzlz(name collate \u0026#34;en_US.utf8\u0026#34;); Using collate for index optimization:\n-- Without any collate keyword, a simple index scan; no extra sorting db_utf8_c=\u0026gt; explain select name from tzlz where name in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name; QUERY PLAN --------------------------------------------------------------------------------- Index Only Scan using idxzlz_default on tzlz (cost=0.29..30.13 rows=8 width=4) Index Cond: (name = ANY (\u0026#39;{a,aa,A,AA,啊,阿,〇}\u0026#39;::text[])) -- Adding collate conversion to the predicate hits the correct index db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;C\u0026#34; in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;); QUERY PLAN --------------------------------------------------------------------------- Index Only Scan using idxzlz_c on tzlz (cost=0.29..30.12 rows=7 width=4) Index Cond: (name = ANY (\u0026#39;{a,aa,A,AA,啊,阿,〇}\u0026#39;::text[])) db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US.utf8\u0026#34; in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;); QUERY PLAN ----------------------------------------------------------------------------------- Index Only Scan using idxzlz_enus_utf8 on tzlz (cost=0.29..30.12 rows=7 width=4) Index Cond: (name = ANY (\u0026#39;{a,aa,A,AA,啊,阿,〇}\u0026#39;::text[])) -- However, the collation name must match exactly db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US\u0026#34; in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;); QUERY PLAN ----------------------------------------------------------------- Seq Scan on tzlz (cost=0.00..232.63 rows=7 width=4) Filter: ((name)::text = ANY (\u0026#39;{a,aa,A,AA,啊,阿,〇}\u0026#39;::text[])) -- ORDER BY also needs the collate conversion expression -- Here, the correct index is used, but ORDER BY treats them as different collations (even though they are the same) db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US.utf8\u0026#34; in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name; QUERY PLAN ----------------------------------------------------------------------------------------- Sort (cost=30.22..30.23 rows=7 width=4) Sort Key: name -\u0026gt; Index Only Scan using idxzlz_enus_utf8 on tzlz (cost=0.29..30.12 rows=7 width=4) Index Cond: (name = ANY (\u0026#39;{a,aa,A,AA,啊,阿,〇}\u0026#39;::text[])) -- Adding collate conversion to both WHERE and ORDER BY selects the right index and avoids extra sorting db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US.utf8\u0026#34; in (\u0026#39;a\u0026#39;,\u0026#39;aa\u0026#39;,\u0026#39;A\u0026#39;,\u0026#39;AA\u0026#39;,\u0026#39;啊\u0026#39;,\u0026#39;阿\u0026#39;,\u0026#39;〇\u0026#39;) order by name collate \u0026#34;en_US.utf8\u0026#34;; QUERY PLAN ------------------------------------------------------------------------------------ Index Only Scan using idxzlz_enus_utf8 on tzlz (cost=0.29..30.12 rows=7 width=42) Index Cond: (name = ANY (\u0026#39;{a,aa,A,AA,啊,阿,〇}\u0026#39;::text[])) After specifying a collation on an index, the SQL must explicitly use the COLLATE keyword to convert the expression. Even if the default is the same as the current collation, PostgreSQL will not use the index.\nLIKE not using index # The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE\nPostgreSQL\u0026rsquo;s own words: using non-C or non-POSIX prevents ordinary indexes from being used!\ndb_utf8=\u0026gt; explain select name from tzlz where name like \u0026#39;a%\u0026#39;; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using idxzlz_c on tzlz (cost=0.29..4.31 rows=1 width=4) Index Cond: ((name \u0026gt;= \u0026#39;a\u0026#39;::text) AND (name \u0026lt; \u0026#39;b\u0026#39;::text)) Filter: ((name)::text ~~ \u0026#39;a%\u0026#39;::text) (3 rows) db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US.utf8\u0026#34; like \u0026#39;a%\u0026#39;; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using idxzlz_c on tzlz (cost=0.29..4.31 rows=1 width=4) Index Cond: ((name \u0026gt;= \u0026#39;a\u0026#39;::text) AND (name \u0026lt; \u0026#39;b\u0026#39;::text)) Filter: ((name)::text ~~ \u0026#39;a%\u0026#39;::text) PostgreSQL converts LIKE to \u0026gt;= and \u0026lt; during index scans, where \u0026lt; adds a \u0026ldquo;one step greater\u0026rdquo; value. This is where the problem lies: collation is strongly tied to sorting order. In ASCII, a+1 is b, but what about Chinese characters?\ndb_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US.utf8\u0026#34; like \u0026#39;阿%\u0026#39;; QUERY PLAN -------------------------------------------------------------------------- Index Only Scan using idxzlz_c on tzlz (cost=0.29..6.49 rows=1 width=4) Index Cond: ((name \u0026gt;= \u0026#39;阿\u0026#39;::text) AND (name \u0026lt; \u0026#39;陿\u0026#39;::text)) Filter: ((name)::text ~~ \u0026#39;阿%\u0026#39;::text) Sure enough, another Chinese character appears!\nIf it\u0026rsquo;s a sequential scan, the \u0026gt;= and \u0026lt; won\u0026rsquo;t appear:\ndb_utf8=\u0026gt; drop index idxzlz_c; DROP INDEX db_utf8=\u0026gt; explain select name from tzlz where name collate \u0026#34;en_US.utf8\u0026#34; like \u0026#39;阿%\u0026#39;; QUERY PLAN ------------------------------------------------------ Seq Scan on tzlz (cost=0.00..170.09 rows=1 width=4) Filter: ((name)::text ~~ \u0026#39;阿%\u0026#39;::text) You can create an index that is (claimed by the PostgreSQL docs to be) unrelated to collation rules:\nCREATE INDEX idx_pattern ON tzlz (name varchar_pattern_ops); Let\u0026rsquo;s look at its execution plan:\ndb_utf8=\u0026gt; explain select name from tzlz where name like \u0026#39;阿%\u0026#39;; QUERY PLAN ----------------------------------------------------------------------------- Index Only Scan using idx_pattern on tzlz (cost=0.29..6.49 rows=1 width=4) Index Cond: ((name ~\u0026gt;=~ \u0026#39;阿\u0026#39;::text) AND (name ~\u0026lt;~ \u0026#39;陿\u0026#39;::text)) Filter: ((name)::text ~~ \u0026#39;阿%\u0026#39;::text) It still auto-generates the \u0026ldquo;one greater\u0026rdquo; string — this is definitely related to collation. It appears to be using C.\nSo we can conclude:\nWhen PostgreSQL uses a regular index for LIKE, it needs to convert it to \u0026gt;= and \u0026lt;, which requires a \u0026ldquo;one greater\u0026rdquo; value relative to the current string. Since collation is strongly tied to ordering, only an index using the same collation can guarantee data correctness. PostgreSQL chooses the non-localized C collation for this.\nThe quickest workaround is to create a C collation index or a pattern index:\ncreate index idxzlz_C on tzlz(name collate \u0026#34;C\u0026#34;); CREATE INDEX idx_pattern ON tzlz (name varchar_pattern_ops); For other adjustments to default collation at various levels, refer to the sections above.\nDevelopers typically don\u0026rsquo;t specify collation when creating indexes. If it\u0026rsquo;s not C or pattern, LIKE won\u0026rsquo;t use the index. Combined with the common choice of the international character set UTF8, this leaves very few localization options in database operations. The recommended setup: character set UTF8, collation C.\nReferences # https://dbafix.com/what-is-the-impact-of-lc_ctype-on-a-postgresql-database/#:~:text=Having%20LC_CTYPE%20set%20to%20%E2%80%98C%E2%80%99%20implies%20that%20C,Postgres%20on%20top%20of%20these%20libc%20functions%2C%20they%E2%80%99re https://www.postgresql.org/docs/current/charset.html https://www.bookstack.cn/read/rds-best-pratice/bfc0037fe00d87dc.md https://help.aliyun.com/zh/rds/apsaradb-rds-for-postgresql/configure-the-collation-of-a-database-on-an-apsaradb-rds-for-postgresql-instance https://baike.baidu.com/item/%E7%BB%9F%E4%B8%80%E7%A0%81/2985798?fromModule=lemma_inlink\u0026fromtitle=Unicode\u0026fromid=750500 https://baike.baidu.com/item/%E4%B8%AD%E6%97%A5%E9%9F%A9%E8%B6%8A%E7%BB%9F%E4%B8%80%E8%A1%A8%E6%84%8F%E6%96%87%E5%AD%97/1301611?fromModule=lemma_inlink\nhttps://blog.csdn.net/songyundong1993/article/details/128739919\nOriginal article (Chinese): PostgreSQL本地化\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/postgresql-localization/","section":"Posts","summary":"Localization Concepts # The purpose of localization is to support the language features and rules of different countries and regions. With localization support, you can use character sets that handle Chinese, French, Japanese, and more. Beyond character sets, there are also character sorting rules and other language-related rule support. For example, we know how to sort (‘a’, ‘b’), but how should (‘a’, ‘A’) and (‘啊’, ‘阿’) be sorted?\n","title":"PostgreSQL Localization","type":"posts"},{"content":" What is a Partitioned Table # Database partitioning splits table data into smaller physical shards to improve performance, availability, and manageability. Partitioned tables are a common optimization technique for large tables in relational databases. DBMS generally provide partition management, and applications can access partitioned tables directly without changing their architecture—though good performance requires proper partition access patterns.\nPartitioned tables are common database technology, but PostgreSQL partitioned tables have many unique characteristics: multiple implementation approaches, partitions being regular tables, partition maintenance strategies, SQL optimization considerations, and some known issues.\nPartition Table Implementations # PostgreSQL provides various partition implementation approaches. The officially supported methods are declarative partitioning and inheritance partitioning, while third-party plugins include pg_pathman, pg_partman, etc. Since the introduction of official declarative partitioning, only one approach is generally recommended: declarative partitioning. Covering every implementation\u0026rsquo;s features, details, and history would make this article excessively long and is less relevant going forward. This article focuses mainly on declarative partitioning, with brief introductions to other approaches. However, due to existing deployments and feature differences, understanding declarative partitioning, inheritance partitioning, and pg_pathman remains valuable.\nDeclarative Partitioning # Declarative partitioning, also called native partitioning, has been supported since PG10. It is the \u0026ldquo;officially supported\u0026rdquo; partitioning approach and the most recommended method. Although different from inheritance partitioning, declarative partitioning is also implemented internally using table inheritance. It supports only three partition methods: RANGE, LIST, and HASH.\nRANGE Partitioning # RANGE partitioned tables split data by range, with partition boundaries defined as [t1, t2) (inclusive lower bound, exclusive upper bound).\nCREATE TABLE PUBLIC.LZLPARTITION1 ( id int, name varchar(50) NULL, DATE_CREATED timestamp NOT NULL DEFAULT now() ) PARTITION BY RANGE(DATE_CREATED); alter table public.lzlpartition1 add primary key(id,DATE_CREATED) create table LZLPARTITION1_202301 partition of LZLPARTITION1 for values from (\u0026#39;2023-01-01 00:00:00\u0026#39;) to (\u0026#39;2023-02-01 00:00:00\u0026#39;); create table LZLPARTITION1_202302 partition of LZLPARTITION1 for values from (\u0026#39;2023-02-01 00:00:00\u0026#39;) to (\u0026#39;2023-03-01 00:00:00\u0026#39;); -- Insert some data into the partitioned table =\u0026gt; INSERT INTO lzlpartition1 SELECT random() * 10000, md5(g::text),g FROM generate_series(\u0026#39;2023-01-01\u0026#39;::date, \u0026#39;2023-02-28\u0026#39;::date, \u0026#39;1 minute\u0026#39;) as g; INSERT 0 83521 For RANGE partitioning, the FROM t1 TO t2 boundary uses the [t1, t2) convention: the lower bound is inclusive and the upper bound is exclusive.\nInspecting the partitioned table shows that each partition is also an independent table:\nlzldb=\u0026gt; \\d+ lzlpartition1 Partitioned table \u0026#34;public.lzlpartition1\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+------------- id | integer | | not null | | plain | | | name | character varying(50) | | | | extended | | | date_created | timestamp without time zone | | not null | now() | plain | | | Partition key: RANGE (date_created) Indexes: \u0026#34;lzlpartition1_pkey\u0026#34; PRIMARY KEY, btree (id, date_created) Partitions: lzlpartition1_202301 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;), lzlpartition1_202302 FOR VALUES FROM (\u0026#39;2023-02-01 00:00:00\u0026#39;) TO (\u0026#39;2023-03-01 00:00:00\u0026#39;) lzldb=\u0026gt; \\d+ lzlpartition1_202301 Table \u0026#34;public.lzlpartition1_202301\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+------------- id | integer | | not null | | plain | | | name | character varying(50) | | | | extended | | | date_created | timestamp without time zone | | not null | now() | plain | | | Partition of: lzlpartition1 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) Partition constraint: ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without time zone)) Indexes: \u0026#34;lzlpartition1_202301_pkey\u0026#34; PRIMARY KEY, btree (id, date_created) Access method: heap Primary keys, indexes, and NOT NULL/CHECK constraints are automatically created on partitions. Since partitions are independent tables, constraints and indexes can also be created on individual partitions. (ATTACH does not automatically create these — see the ATTACH section for details.)\nLIST Partitioning # LIST partitioning stores data in the corresponding partition based on specified partition key values.\nCREATE TABLE cities ( city_id bigserial not null, name text, population bigint ) PARTITION BY LIST (left(lower(name), 1)); CREATE TABLE cities_ab PARTITION OF cities FOR VALUES IN (\u0026#39;a\u0026#39;, \u0026#39;b\u0026#39;); CREATE TABLE cities_null PARTITION OF cities ( CONSTRAINT city_id_nonzero CHECK (city_id != 0) ) FOR VALUES IN (null); insert into cities(name,population) values(\u0026#39;Acity\u0026#39;,10); insert into cities(name,population) values(null,20); =\u0026gt; SELECT tableoid::regclass,* FROM cities; tableoid | city_id | name | population -------------+---------+--------+------------ cities_ab | 1 | Acity | 10 cities_null | 2 | [null] | 20 LIST partitioned tables support creating a NULL partition.\nHASH Partitioning # HASH partitioning distributes data across partitions to spread out hot data evenly.\nCREATE TABLE orders (order_id int,name varchar(10)) PARTITION BY HASH (order_id); CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (MODULUS 3, REMAINDER 0); CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (MODULUS 3, REMAINDER 1); CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (MODULUS 3, REMAINDER 2); You cannot create a default partition, nor can you create more partitions than the specified MODULUS.\n=\u0026gt; CREATE TABLE orders_p2 PARTITION OF orders -\u0026gt; FOR VALUES WITH (MODULUS 3, REMAINDER 4); ERROR: 42P16: remainder for hash partition must be less than modulus LOCATION: transformPartitionBound, parse_utilcmd.c:3939 =\u0026gt; CREATE TABLE orders_p4 PARTITION OF orders default; ERROR: 42P16: a hash-partitioned table may not have a default partition LOCATION: transformPartitionBound, parse_utilcmd.c:3909 Insert data:\n=\u0026gt;insert into orders values(generate_series(1,10000),\u0026#39;a\u0026#39;); INSERT 0 10000 =\u0026gt;SELECT tableoid::regclass,count(*) FROM orders group by tableoid::regclass; tableoid | count -----------+------- orders_p1 | 3277 orders_p3 | 3354 orders_p2 | 3369 =\u0026gt;select tableoid::regclass,* from orders limit 30; tableoid | order_id | name -----------+----------+------ orders_p1 | 2 | a orders_p1 | 4 | a orders_p1 | 6 | a orders_p1 | 8 | a orders_p1 | 15 | a orders_p1 | 16 | a orders_p1 | 18 | a orders_p1 | 19 | a orders_p1 | 20 | a HASH partition data is distributed evenly across partitions:\n-- Insert 100 NULL rows =\u0026gt; insert into orders values(null,generate_series(1,100)::text); INSERT 0 100 =\u0026gt; SELECT tableoid::regclass,count(*) FROM orders where order_id is null group by tableoid::regclass; tableoid | count -----------+------- orders_p1 | 100 -- All NULL data ends up on the remainder 0 partition =\u0026gt;\\d+ orders_p1 Table \u0026#34;public.orders_p1\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------+-----------+----------+---------+----------+--------------+------------- order_id | integer | | | | plain | | name | character varying(10) | | | | extended | | Partition of: orders FOR VALUES WITH (modulus 3, remainder 0) Partition constraint: satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 3, 0, order_id) Although HASH partitioned tables have no concept of a NULL partition, they can store NULL data. NULL values are placed on the remainder 0 partition.\nMulti-level (Mixed) Partitioning # Partitions can themselves be further partitioned, forming a cascading structure. Sub-partitions can use different partition methods — this is called mixed partitioning. Creating a mixed partition:\ncreate table part_1000(id bigserial not null,name varchar(10),createddate timestamp) partition by range(createddate); create table part_2001 partition of part_1000 for values from (\u0026#39;2023-01-01 00:00:00\u0026#39;) to (\u0026#39;2023-02-01 00:00:00\u0026#39;) partition by list(name) ; create table part_2002 partition of part_1000 for values from (\u0026#39;2023-02-01 00:00:00\u0026#39;) to (\u0026#39;2023-03-01 00:00:00\u0026#39;) partition by list(name) ; create table part_2003 partition of part_1000 for values from (\u0026#39;2023-03-01 00:00:00\u0026#39;) to (\u0026#39;2023-04-01 00:00:00\u0026#39;) partition by list(name) ; create table part_3001 partition of part_2001 FOR VALUES IN (\u0026#39;abc\u0026#39;); create table part_3002 partition of part_2001 FOR VALUES IN (\u0026#39;def\u0026#39;); create table part_3003 partition of part_2001 FOR VALUES IN (\u0026#39;jkl\u0026#39;); \\d+ only shows the immediate next-level partitions:\n\\d+ part_1000 Partitioned table \u0026#34;dbmgr.part_1000\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -------------+-----------------------------+-----------+----------+---------------------------------------+----------+--------------+------------- id | bigint | | not null | nextval(\u0026#39;part_1000_id_seq\u0026#39;::regclass) | plain | | name | character varying(10) | | | | extended | | createddate | timestamp without time zone | | | | plain | | Partition key: RANGE (createddate) Partitions: part_2001 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;), PARTITIONED, part_2002 FOR VALUES FROM (\u0026#39;2023-02-01 00:00:00\u0026#39;) TO (\u0026#39;2023-03-01 00:00:00\u0026#39;), PARTITIONED, part_2003 FOR VALUES FROM (\u0026#39;2023-03-01 00:00:00\u0026#39;) TO (\u0026#39;2023-04-01 00:00:00\u0026#39;), PARTITIONED =\u0026gt; \\d+ part_2001 Partitioned table \u0026#34;dbmgr.part_2001\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -------------+-----------------------------+-----------+----------+---------------------------------------+----------+--------------+------------- id | bigint | | not null | nextval(\u0026#39;part_1000_id_seq\u0026#39;::regclass) | plain | | name | character varying(10) | | | | extended | | createddate | timestamp without time zone | | | | plain | | Partition of: part_1000 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) Partition constraint: ((createddate IS NOT NULL) AND (createddate \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) AND (createddate \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without time zone)) Partition key: LIST (name) Partitions: part_3001 FOR VALUES IN (\u0026#39;abc\u0026#39;), part_3002 FOR VALUES IN (\u0026#39;def\u0026#39;), part_3003 FOR VALUES IN (\u0026#39;jkl\u0026#39;) Now insert a row:\n=\u0026gt; insert into part_1000 values(random() * 10000,\u0026#39;abc\u0026#39;,\u0026#39;2023-01-01 08:00:00\u0026#39;); INSERT 0 1 =\u0026gt; SELECT tableoid::regclass,* FROM part_1000; tableoid | id | name | createddate -----------+------+------+--------------------- part_3001 | 6385 | abc | 2023-01-01 08:00:00 Data is stored in the lowest-level sub-partition.\nDeclarative Partitioning Feature Summary # No INTERVAL partitioning. There is no built-in automatic partition creation feature, which makes maintenance more cumbersome. Partitions themselves are tables. This is a distinctive characteristic. This not only allows PostgreSQL to flexibly operate on sub-partitions but, more importantly, affects functionality and behavior. TRUNCATE, VACUUM, and ANALYZE on a partitioned table operate on all partitions. TRUNCATE ONLY cannot be executed on the parent table but can be executed on a child table containing data, clearing only that sub-partition. RANGE and HASH partition keys can have multiple columns; LIST partition keys can only be a single column or expression. The partitioned parent table itself is empty; only the lowest-level sub-partitions contain data. A DEFAULT partition receives data that falls outside declared ranges. Without a DEFAULT partition, inserting out-of-range data will raise an error. When adding a new partition, check whether the DEFAULT partition contains data belonging to the new partition. Partitions created via PARTITION OF automatically create indexes, constraints, and row-level triggers from the parent table. ATTACH does not handle any indexes, constraints, or other objects. Inheritance Partitioning # Inheritance partitioning is also officially supported. It leverages PostgreSQL\u0026rsquo;s table inheritance feature to implement partitioning functionality. Inheritance partitioning is more flexible than declarative partitioning. Implementing inheritance partitioning requires two PostgreSQL features: table inheritance and write redirection. Write redirection can be implemented via rules or triggers.\nCreating Inheritance Partition Tables # Example of creating inheritance partitioned tables: 1. Create the parent table\nCREATE TABLE measurement ( city_id int not null, logdate date not null, peaktemp int, unitsales int ); 2. Create child tables with CHECK constraints for partitioning ranges\nCREATE TABLE measurement_202308 ( CHECK ( logdate \u0026gt;= DATE \u0026#39;2023-08-01\u0026#39; AND logdate \u0026lt; DATE \u0026#39;2023-09-01\u0026#39; ) ) INHERITS (measurement); CREATE TABLE measurement_202309 ( CHECK ( logdate \u0026gt;= DATE \u0026#39;2023-09-01\u0026#39; AND logdate \u0026lt; DATE \u0026#39;2023-10-01\u0026#39; ) ) INHERITS (measurement); 3. Create rules or triggers to redirect inserted data to the corresponding child tables\nCREATE OR REPLACE FUNCTION measurement_insert_trigger() RETURNS TRIGGER AS $$ BEGIN IF ( NEW.logdate \u0026gt;= DATE \u0026#39;2023-08-01\u0026#39; AND NEW.logdate \u0026lt; DATE \u0026#39;2023-09-01\u0026#39; ) THEN INSERT INTO measurement_202308 VALUES (NEW.*); ELSIF ( NEW.logdate \u0026gt;= DATE \u0026#39;2023-09-01\u0026#39; AND NEW.logdate \u0026lt; DATE \u0026#39;2023-10-01\u0026#39; ) THEN INSERT INTO measurement_202309 VALUES (NEW.*); ELSE RAISE EXCEPTION \u0026#39;Date out of range. Fix the measurement_insert_trigger() function!\u0026#39;; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql; CREATE TRIGGER insert_measurement_trigger BEFORE INSERT ON measurement FOR EACH ROW EXECUTE FUNCTION measurement_insert_trigger(); A basic inheritance partitioned table is now set up.\n=\u0026gt;\\d+ measurement Table \u0026#34;public.measurement\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+---------+-----------+----------+---------+---------+--------------+------------- city_id | integer | | not null | | plain | | logdate | date | | not null | | plain | | peaktemp | integer | | | | plain | | unitsales | integer | | | | plain | | Triggers: insert_measurement_trigger BEFORE INSERT ON measurement FOR EACH ROW EXECUTE FUNCTION measurement_insert_trigger() Child tables: measurement_202308, measurement_202309 Access method: heap Test insertion and querying:\n-- Inserting data outside the defined range raises an error =\u0026gt; insert into measurement values(1001, now() - interval \u0026#39;31\u0026#39; day ,1,1); ERROR: P0001: Date out of range. Fix the measurement_insert_trigger() function! CONTEXT: PL/pgSQL function measurement_insert_trigger() line 10 at RAISE LOCATION: exec_stmt_raise, pl_exec.c:3889 -- Inserting data is redirected to the child table =\u0026gt; insert into measurement values(1001,now(),1,1); INSERT 0 0 -- Querying the parent table returns data from child tables =\u0026gt; select tableoid::regclass,* from measurement; tableoid | city_id | logdate | peaktemp | unitsales --------------------+---------+------------+----------+----------- measurement_202308 | 1001 | 2023-08-03 | 1 | 1 RULE vs. Trigger Besides triggers, PostgreSQL can also use rules to redirect inserts. Example rule statements:\nCREATE RULE measurement_insert_202308 AS ON INSERT TO measurement WHERE ( logdate \u0026gt;= DATE \u0026#39;2023-08-01\u0026#39; AND logdate \u0026lt; DATE \u0026#39;2023-08-01\u0026#39; ) DO INSTEAD INSERT INTO measurement_202308 VALUES (NEW.*); CREATE RULE measurement_insert_202309 AS ON INSERT TO measurement WHERE ( logdate \u0026gt;= DATE \u0026#39;2023-09-01\u0026#39; AND logdate \u0026lt; DATE \u0026#39;2023-09-01\u0026#39; ) DO INSTEAD INSERT INTO measurement_202309 VALUES (NEW.*); Differences between rules and triggers:\nRules have worse performance than triggers in general, but for bulk inserts rules perform better since they only check once. In all other cases, triggers are preferable. COPY does not fire rules but does fire triggers. When using rules, data can be COPY\u0026rsquo;d directly into child tables. When inserting data outside defined ranges, rules will insert into the parent table, while triggers will raise an error. Indexes To improve performance, you also need to create indexes and enable constraint_exclusion. Indexes on partitions are generally essential. For inheritance tables, indexes must be manually created on child tables. Example of creating indexes:\nCREATE INDEX idx_measurement_202308_logdate ON measurement_202308 (logdate); CREATE INDEX idx_measurement_202309_logdate ON measurement_202309 (logdate); Insert some data and check the execution plan:\n-- \u0026#39;2023-08-04\u0026#39; has only 1 row, allowing it to use the index =\u0026gt; insert into measurement values(1001,now()+interval \u0026#39;1\u0026#39; day,1,1); INSERT 0 0 insert into orders values(generate_series(1,10000),\u0026#39;a\u0026#39;); =\u0026gt; insert into measurement values(generate_series(1,1000),now(),1,1); INSERT 0 0 =\u0026gt; explain select * from measurement where logdate=\u0026#39;2023-08-04\u0026#39;; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Append (cost=0.00..5.17 rows=2 width=16) -\u0026gt; Seq Scan on measurement measurement_1 (cost=0.00..0.00 rows=1 width=16) Filter: (logdate = \u0026#39;2023-08-04\u0026#39;::date) -\u0026gt; Index Scan using idx_measurement_202308_logdate on measurement_202308 measurement_2 (cost=0.14..5.16 rows=1 width=16) Index Cond: (logdate = \u0026#39;2023-08-04\u0026#39;::date) In the above execution plan, the August partition uses the index on the partition. Since constraint_exclusion is enabled by default for inheritance tables, the September partition was excluded and only August was scanned. However, because the parent table has no constraints (and cannot have them), it always appears in the execution plan—but since the parent table is generally empty, this has minimal impact.\nconstraint_exclusion # constraint_exclusion controls whether the optimizer uses constraints to reduce unnecessary table access. This parameter is commonly used in inheritance partitioning optimization — by reducing child table access, it improves SQL performance. (This functionality is similar to the enable_partition_pruning parameter, which controls partition pruning for declarative partitioned tables.) constraint_exclusion has three values: on: All tables are checked for constraints. partition: Inheritance tables and UNION ALL subqueries are checked for constraints (default). off: Constraints are not checked. Constraint exclusion only occurs during execution plan generation, not during actual execution (partition pruning can occur during execution). This means constraint exclusion does not happen when using bound parameters or variable values. For example, when using functions like now() whose specific value the optimizer cannot determine, the optimizer cannot exclude partitions that don\u0026rsquo;t need to be accessed at all:\n=\u0026gt; select now(); now ------------------------------- 2023-08-03 17:12:04.772658+08 -- The optimizer did not exclude the September partition =\u0026gt; explain select * from measurement where logdate\u0026lt;=now(); QUERY PLAN ----------------------------------------------------------------------------------------------------- Append (cost=0.00..55.98 rows=1628 width=16) -\u0026gt; Seq Scan on measurement measurement_1 (cost=0.00..0.00 rows=1 width=16) Filter: (logdate \u0026lt;= now()) -\u0026gt; Seq Scan on measurement_202308 measurement_2 (cost=0.00..21.15 rows=1010 width=16) Filter: (logdate \u0026lt;= now()) -\u0026gt; Bitmap Heap Scan on measurement_202309 measurement_3 (cost=7.44..26.69 rows=617 width=16) Recheck Cond: (logdate \u0026lt;= now()) -\u0026gt; Bitmap Index Scan on idx_measurement_202309_logdate (cost=0.00..7.28 rows=617 width=0) Index Cond: (logdate \u0026lt;= now()) Additionally, constraint exclusion itself needs to check all child table constraints. If there are too many child table constraints, the efficiency of generating execution plans will be affected. Therefore, inheritance partitioning is not recommended for creating too many child partitions.\nAdding/Removing Partitions in Inheritance Partitioning # To turn an inherited partition into a regular table:\nALTER TABLE measurement_202308 NO INHERIT measurement; To add an existing regular table (with data) as a child table in the inheritance partition:\nCREATE TABLE measurement_202310 (LIKE measurement INCLUDING DEFAULTS INCLUDING CONSTRAINTS); ALTER TABLE measurement_202310 ADD CONSTRAINT measurement_202310_logdate_check CHECK ( logdate \u0026gt;= DATE \u0026#39;2023-10-01\u0026#39; AND logdate \u0026lt; DATE \u0026#39;2023-11-01\u0026#39; ); --insert into measurement_202310 values(2001,\u0026#39;20231010\u0026#39;,3,3); ALTER TABLE measurement_202310 INHERIT measurement; Inheritance Partitioning Feature Summary # Inheritance partitioning is more flexible than declarative partitioning, but some declarative partitioning features are unavailable. Child tables inherit parent table constraints, so global constraints should not be set on the parent table. Indexes are not inherited; they must be created individually on each child table. Declarative partitioning only supports RANGE, LIST, and HASH partitions. Inheritance partitioning can support more, including custom partitioning methods. Dropping a child table does not invalidate the trigger. PostgreSQL does not have Oracle\u0026rsquo;s concept of invalidated objects (indexes do have an invalidation concept). Generally, using triggers for insert redirection is more efficient than rules. When adding a new partition, if the trigger function lacks a rule for that partition, the trigger function needs to be updated. Inheritance partitioning supports multiple inheritance. Constraint exclusion cannot occur during execution; using fixed values for queries is recommended. With inheritance partitioning, avoid creating too many child partitions. pg_pathman # pg_pathman is a third-party plugin implementing partitioning functionality. The pg_pathman README on GitHub and articles on using pg_pathman already describe pathman in great detail. Here we only highlight key points and do some simple testing.\npg_pathman Basics # No Longer Maintained\nNOTE: this project is not under development anymore\npg_pathman supports PostgreSQL 9.5 through 15. Later PostgreSQL versions are no longer supported, and existing versions only receive bug fixes — no new features will be added. pg_pathman emerged because older PostgreSQL versions had incomplete partitioning features. Now that native partitioned tables (declarative partitioning) are very mature, pg_pathman also recommends using native partitioned tables. Existing pg_pathman partitioned tables are also recommended to be migrated to native partitioned tables. pg_pathman, once recognized by many users, is now history. Even though it\u0026rsquo;s no longer updated, its feature set is still richer than the current native partitioned tables. Feature Highlights pg_pathman is quite powerful, supporting some features that native partitioned tables do not. However, pathman is not perfect either and has many issues in practice. Key points to note about pg_pathman include:\npg_pathman can manage partitions through partition management functions. It supports replace, merge, split partition operations; attach and detach operations; and INTERVAL partitioning. pg_pathman has many optimizations for partitioned table execution plans. pg_pathman only supports RANGE and HASH partition types. The pathman_config table stores partition configuration information; it provides partition task views. Partition information is cached in memory for execution plan generation. Basic pg_pathman Usage # Creating pathman RANGE partitions\n-- The regular table serves as the parent table CREATE TABLE journal ( id SERIAL, dt TIMESTAMP NOT NULL, level INTEGER, msg TEXT); -- Indexes on the parent table are automatically created on child partitions CREATE INDEX ON journal(dt); -- Create partitions select create_range_partitions(\u0026#39;journal\u0026#39;::regclass, \u0026#39;dt\u0026#39;, \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp, interval \u0026#39;1 month\u0026#39;, 6, false) ; -- View table definition =\u0026gt; \\d+ journal Table \u0026#34;public.journal\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+-----------------------------+-----------+----------+-------------------------------------+----------+--------------+------------- id | integer | | not null | nextval(\u0026#39;journal_id_seq\u0026#39;::regclass) | plain | | dt | timestamp without time zone | | not null | | plain | | level | integer | | | | plain | | msg | text | | | | extended | | Indexes: \u0026#34;journal_dt_idx\u0026#34; btree (dt) Child tables: journal_1, journal_2, journal_3, journal_4, journal_5, journal_6 Access method: heap =\u0026gt; \\d+ journal_6 Table \u0026#34;public.journal_6\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+-----------------------------+-----------+----------+-------------------------------------+----------+--------------+------------- id | integer | | not null | nextval(\u0026#39;journal_id_seq\u0026#39;::regclass) | plain | | dt | timestamp without time zone | | not null | | plain | | level | integer | | | | plain | | msg | text | | | | extended | | Indexes: \u0026#34;journal_6_dt_idx\u0026#34; btree (dt) Check constraints: \u0026#34;pathman_journal_6_check\u0026#34; CHECK (dt \u0026gt;= \u0026#39;2023-06-01 00:00:00\u0026#39;::timestamp without time zone AND dt \u0026lt; \u0026#39;2023-07-01 00:00:00\u0026#39;::timestamp without time zone) Inherits: journal Access method: heap -- Insert data INSERT INTO journal (dt, level, msg) SELECT g, random() * 10000, md5(g::text) FROM generate_series(\u0026#39;2023-01-01\u0026#39;::date, \u0026#39;2023-02-28\u0026#39;::date, \u0026#39;1 hour\u0026#39;) as g; -- Insert data for which no corresponding partition has been created yet =\u0026gt; INSERT INTO journal (dt, level, msg) values(\u0026#39;2023-07-01\u0026#39;::date,\u0026#39;11\u0026#39;,\u0026#39;1\u0026#39;); INSERT 0 1 -- Check partition data distribution; the INTERVAL partition has been automatically created =\u0026gt; SELECT tableoid::regclass AS partition, count(*) FROM journal group by partition; partition | count -----------+------- journal_7 | 1 journal_2 | 649 journal_1 | 744 -- View execution plan -- Partition pruning has occurred =\u0026gt; explain select * from journal where dt=\u0026#39;2023-01-01 22:00:00\u0026#39;; QUERY PLAN ----------------------------------------------------------------------------------------------------- Append (cost=0.00..5.30 rows=2 width=48) -\u0026gt; Seq Scan on journal journal_1 (cost=0.00..0.00 rows=1 width=48) Filter: (dt = \u0026#39;2023-01-01 22:00:00\u0026#39;::timestamp without time zone) -\u0026gt; Index Scan using journal_1_dt_idx on journal_1 journal_1_1 (cost=0.28..5.29 rows=1 width=49) Index Cond: (dt = \u0026#39;2023-01-01 22:00:00\u0026#39;::timestamp without time zone) Creating pathman HASH partitions\n-- Create parent table CREATE TABLE items ( id SERIAL PRIMARY KEY, name TEXT, code BIGINT); -- Create HASH partitions select create_hash_partitions(\u0026#39;items\u0026#39;::regclass, \u0026#39;id\u0026#39;, 3, false) ; -- Insert data INSERT INTO items (id, name, code) SELECT g, md5(g::text), random() * 100000 FROM generate_series(1, 1000) as g; =\u0026gt; SELECT tableoid::regclass AS partition, count(*) FROM items group by partition; partition | count -----------+------- items_2 | 344 items_0 | 318 items_1 | 338 =\u0026gt; \\d+ items Table \u0026#34;public.items\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+-----------------------------------+----------+--------------+------------- id | integer | | not null | nextval(\u0026#39;items_id_seq\u0026#39;::regclass) | plain | | name | text | | | | extended | | code | bigint | | | | plain | | Indexes: \u0026#34;items_pkey\u0026#34; PRIMARY KEY, btree (id) Child tables: items_0, items_1, items_2 Access method: heap =\u0026gt; \\d+ items_1 Table \u0026#34;public.items_1\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+-----------------------------------+----------+--------------+------------- id | integer | | not null | nextval(\u0026#39;items_id_seq\u0026#39;::regclass) | plain | | name | text | | | | extended | | code | bigint | | | | plain | | Indexes: \u0026#34;items_1_pkey\u0026#34; PRIMARY KEY, btree (id) Check constraints: \u0026#34;pathman_items_1_check\u0026#34; CHECK (get_hash_part_idx(hashint4(id), 3) = 1) Inherits: items Access method: heap =\u0026gt; SELECT tableoid::regclass AS partition, count(*) FROM items group by partition; partition | count -----------+------- items_2 | 344 items_0 | 318 items_1 | 338 Pros and Cons of PostgreSQL Partitioned Tables # Advantages of Partitioned Tables # SQL performance improvement. In certain scenarios, such as splitting a large amount of data into multiple partitions where SQL only needs to query one partition, SQL performance can be dramatically improved. Partitions can work together with indexes. For example, accessing an index on a single partition is more efficient than accessing a large unpartitioned index. Dropping a single partition is much more efficient than deleting many rows. This is common in time-range partitioning — dropping an unused historical partition is very fast, but without partitioning, DELETE operations are not only slow but also require additional maintenance. VACUUM is faster. Reclaiming old version information or collecting statistics on a large table is very slow. If VACUUM hasn\u0026rsquo;t finished executing, SQL may already be experiencing problems. With partitioning, VACUUM becomes much faster. I/O distribution capability. Different partitions can be placed on different paths or different disks. Rarely-used data can be placed on cheaper disks. More maintenance techniques. Directly maintaining a very large table is difficult — for example, VACUUM on an extremely large table has many issues. With partitioned tables, each partition can run VACUUM independently. Moreover, ATTACH/DETACH, local indexes/constraints, and more can be flexibly used in many scenarios. Disadvantages of Partitioned Tables # In PostgreSQL, every partition of a partitioned table can be treated as a regular table. Too many partitions can lead to longer SQL parsing times and higher memory load, even causing errors. See the previous article: Too many range table entries even with a modest number of partitions Even if having too many partitions doesn\u0026rsquo;t cause errors, and partition pruning doesn\u0026rsquo;t happen during execution plan generation (it might happen during execution), the EXPLAIN output will be extremely long. At that point, the logs will also contain lengthy execution plans, affecting log readability. Some strange issues: Different users see different execution plans Limitations of Partitioned Tables # No native automatic partition creation feature\nOnly local partition indexes are supported; global indexes are not supported\nPrimary keys must include the partition key. PostgreSQL currently can only enforce uniqueness within each partition, hence this limitation. Oracle and MySQL do not have this restriction.\nUnique indexes must include the partition key. PostgreSQL currently can only enforce uniqueness within each partition. Same applies to primary keys.\nCannot create globally-defined constraints\nBEFORE ROW INSERT triggers cannot update the partition into which the row is being inserted.\nTemporary table partitions and regular table partitions cannot coexist under the same partitioned table.\nIn declarative partitioning, parent and child table columns must be identical; in inheritance partitioning, child tables can have more columns than the parent table.\nIn declarative partitioning, CHECK and NOT NULL constraints are always inherited; these two constraints cannot be set independently on individual partitions.\nRANGE partitions cannot store NULL values. HASH partitions have no concept of NULL partitions but can store NULL values — they are placed on the remainder 0 partition. LIST partitions can explicitly create a NULL partition to store NULL data.\nWhen Should You Use Partitioned Tables? # First, to use partitioned tables you must understand the advantages, disadvantages, and limitations they bring. For example, when data volume is large, partitioning can improve performance; hot/cold data separation also makes partition data management easier. You should decide whether to partition and how to partition based on your specific business situation and hardware resources. However, developers will always ask questions like \u0026ldquo;how much data warrants partitioning.\u0026rdquo; Advice on using partitioned tables can only be given in general terms. If you don\u0026rsquo;t know how to partition, you can refer to the following recommendations (if you already have sufficient understanding of table partitioning, please ignore):\nThe table data is large enough, and SQL queries on the table always or can include the partition key column. Clear hot/cold data separation. For example, new data is always inserted into the current month\u0026rsquo;s partition, while the other 11 months of old partitions are read-only. VACUUM can no longer keep up. Partition Table Permissions # Permission issues are less discussed in the context of partitioned table knowledge, but they are still worth paying attention to. Because PostgreSQL has the concept that \u0026ldquo;partition child tables are also regular tables,\u0026rdquo; this differs from other common databases (Oracle, MySQL). For example, in Oracle you don\u0026rsquo;t need to worry about partition child table permissions, but in PostgreSQL you do.\nPARTITION OF / ATTACH do not inherit the parent table\u0026rsquo;s permissions to child tables:\n-- Grant SELECT on the partitioned table to a regular user =\u0026gt; grant select on lzlpartition1 to userlzl; GRANT -- Check permissions — only the parent table has been granted; existing partition child tables are not automatically granted =\u0026gt; select grantee,table_schema,table_name,privilege_type from information_schema.table_privileges where grantee=\u0026#39;userlzl\u0026#39;; grantee | table_schema | table_name | privilege_type ---------+--------------+---------------+---------------- userlzl | public | lzlpartition1 | SELECT -- Create a partition using PARTITION OF =\u0026gt; create table LZLPARTITION1_202303 partition of LZLPARTITION1 for values from (\u0026#39;2023-03-01 00:00:00\u0026#39;) to (\u0026#39;2023-04-01 00:00:00\u0026#39;); CREATE TABLE -- Create a partition using ATTACH =\u0026gt; CREATE TABLE lzlpartition1_202304 -\u0026gt; (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS); CREATE TABLE =\u0026gt; alter table lzlpartition1 attach partition lzlpartition1_202304 for values from (\u0026#39;2023-04-01 00:00:00\u0026#39;) to (\u0026#39;2023-05-01 00:00:00\u0026#39;); ALTER TABLE -- Check permissions again — newly created child partitions are not automatically granted to other users (but permissions are automatically granted to the owner) =\u0026gt; select grantee,table_schema,table_name,privilege_type from information_schema.table_privileges where grantee=\u0026#39;userlzl\u0026#39;; grantee | table_schema | table_name | privilege_type ---------+--------------+---------------+---------------- userlzl | public | lzlpartition1 | SELECT At this point, user userlzl has no access permissions to any child tables, but has permissions on the parent table. userlzl can access partition data through the parent table, but cannot access data by directly querying child tables:\n=\u0026gt; \\c - userlzl; You are now connected to database \u0026#34;dbmgr\u0026#34; as user \u0026#34;userlzl\u0026#34;. =\u0026gt; select * from LZLPARTITION1 where date_created=\u0026#39;2023-01-02 10:00:00\u0026#39;; id | name | date_created ------+----------------------------------+--------------------- 2159 | d05d716da126ff4b44d934344cc4dd7a | 2023-01-02 10:00:00 =\u0026gt; select * from LZLPARTITION1_202301 where date_created=\u0026#39;2023-01-02 10:00:00\u0026#39;; ERROR: 42501: permission denied for table lzlpartition1_202301 LOCATION: aclcheck_error, aclchk.c:3466 Since ATTACH/DETACH does not handle permissions, if we DETACH a partition at this point, that partition will also be inaccessible to userlzl:\n=\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202303; ALTER TABLE =\u0026gt; \\dp+ lzlpartition1_202303; Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies --------+----------------------+-------+-------------------+-------------------+---------- dbmgr | lzlpartition1_202303 | table | | | =\u0026gt; select * from LZLPARTITION1_202301 where date_created=\u0026#39;2023-01-02 10:00:00\u0026#39;; ERROR: 42501: permission denied for table lzlpartition1_202301 From this we can conclude:\nPartition child tables and the parent table exist as regular tables in PostgreSQL, each with their own permission system. If you lack child table permissions but have parent table permissions, you can still access child table data. PARTITION OF, ATTACH, and DETACH do not handle permission issues. However, partition table permissions do not merely control whether access is possible. Lacking partition child table permissions can lead to abnormal execution plans. Reference article: Different users see different execution plans This issue is an intermittent phenomenon that causes superusers and regular users to see different SQL execution plans. The actual business SQL execution plan is abnormal but goes unnoticed, making it difficult to diagnose. Partition child tables have their own statistics, and child table permissions are inconsistent with the parent table (even for partitions created via PARTITION OF), resulting in users being able to access child table data through the parent table but unable to view the child table\u0026rsquo;s statistics. This permission issue leads to differences in execution plans. This contradicts the general concept that \u0026ldquo;permissions only control whether you can access a table, not how you access it,\u0026rdquo; so attention must be paid to this permission issue. To provide permission for child table statistics, it is recommended to explicitly grant SELECT on all child tables to the user, which avoids the issues above:\ngrant select on table_partition_allname to username; Partition Table Maintenance # ATTACH/DETACH Basic Operations # ATTACH/DETACH can add/detach an existing table as a partition of/detach from a partitioned table. ATTACH/DETACH is very useful in maintenance work. First, let\u0026rsquo;s look at the locking behavior of adding partitions via \u0026ldquo;CREATE TABLE \u0026hellip; PARTITION OF\u0026rdquo; and deleting partitions via \u0026ldquo;DROP TABLE\u0026rdquo;:\nLock Matrix: https://www.postgresql.org/docs/current/explicit-locking.html\nLock Requests: https://postgres-locks.husseinnasser.com\nAdding a partition via PARTITION OF -- Session 1: Start a transaction, read-only data =\u0026gt; select * from lzlpartition1 where date_created=\u0026#39;2023-01-01 00:00:00\u0026#39;; id | name | date_created ------+----------------------------------+--------------------- 8249 | 256ac66bb53d31bc6124294238d6410c | 2023-01-01 00:00:00 -- Session 3: Check lock status. When reading data from one partition, locks are acquired on both the child partition and the parent table. =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+-----------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | AccessShareLock | t -- Session 2: Add a partition via PARTITION OF =\u0026gt; create table LZLPARTITION1_202305 partition of LZLPARTITION1 for values from (\u0026#39;2023-05-01 00:00:00\u0026#39;) to (\u0026#39;2023-06-01 00:00:00\u0026#39;); -- Waiting -- Session 3: Check locks again =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+---------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 308525 | AccessExclusiveLock | f -- This is the PARTITION OF session -- Session 4: Run an arbitrary query =\u0026gt; select * from lzlpartition1 where date_created=\u0026#39;2023-01-01 00:00:00\u0026#39;; -- Waiting -- Session 4: Check locks again =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+---------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 308525 | AccessExclusiveLock | f relation | dbmgr | lzlpartition1 | [null] | [null] | 84774 | AccessShareLock | f -- Query is blocked When adding a partition via PARTITION OF, an AccessExclusiveLock is requested on the parent table. This waits for all transactions on the parent table and also blocks all transactions on the parent table. Although the PARTITION OF statement itself executes quickly, if there are long-running transactions on the parent table, all operations on the partitioned table will stall for an extended period. Without a maintenance window, using PARTITION OF to add partitions directly is not recommended.\nDropping a partition via DROP TABLE -- Session 1: Start another read-only transaction -- Session 2: Drop a child partition of the partitioned table =\u0026gt; drop table lzlpartition1_202305; -- Waiting -- Session 3: Check lock status =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+---------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 308525 | AccessExclusiveLock | f Dropping a child partition with DROP TABLE requests an AccessExclusiveLock on the parent table, waiting for all and blocking all. Similarly, this must be used with caution in production environments.\nATTACH — adding a partition ATTACH attaches an existing regular table to a partitioned table. Although both ATTACH and PARTITION OF can add partitions, note that ATTACH does not automatically create indexes, constraints, default values, or row-level triggers — this differs from PARTITION OF. First, create a table: -- To reduce tedious DDL, use LIKE to create the table CREATE TABLE lzlpartition1_202305 (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS); Now observe whether ATTACH is blocked:\n-- Session 1: Start a read-write transaction =\u0026gt;begin; BEGIN =\u0026gt; insert into lzlpartition1 values(\u0026#39;1234\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-01-01 01:00:00\u0026#39;); INSERT 0 1 -- Session 3: Check lock status =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 311449 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | RowExclusiveLock | t -- DML statements acquire RowExclusiveLock on the partition parent table and the corresponding partition child table -- Session 2: ATTACH the newly created table to the partition parent table =\u0026gt; alter table lzlpartition1 attach partition lzlpartition1_202305 for values from (\u0026#39;2023-05-01 00:00:00\u0026#39;) to (\u0026#39;2023-06-01 00:00:00\u0026#39;); ALTER TABLE ATTACH only requests a SHARE UPDATE EXCLUSIVE lock, which is much lighter than ACCESS EXCLUSIVE. ATTACH does not block reads or writes, so ATTACH is recommended for adding partitions — it does not affect business operations and can be executed online.\nDETACH — removing a partition DETACH removes a partition from the partitioned table, turning it into a regular table:\n-- Session 1: Keep the DML transaction uncommitted -- Session 2: DETACH a partition alter table lzlpartition1 detach partition lzlpartition1_202305; -- Waiting -- Session 3: Check lock status =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+---------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 311449 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 311449 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 308525 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 308525 | AccessExclusiveLock | f Unlike ATTACH, DETACH requests an AccessExclusiveLock on the parent table, waiting for all and blocking all.\nDETACH CONCURRENTLY\nStarting from PostgreSQL 14, DETACH gained two new syntax variants: CONCURRENTLY and FINALIZE.\nALTER TABLE [ IF EXISTS ] name DETACH PARTITION partition_name [ CONCURRENTLY | FINALIZE ]\nDETACH CONCURRENTLY internally starts two transactions. The first transaction requests a SHARE UPDATE EXCLUSIVE lock on both the parent and child tables, marking the partition as being in a detaching state, at which point it waits for all transactions on the partitioned table to commit. Once all those transactions have committed, the second transaction requests a SHARE UPDATE EXCLUSIVE lock on the parent table and an ACCESS EXCLUSIVE lock on that child table, after which DETACH CONCURRENTLY completes.\nAdditionally, after DETACH CONCURRENTLY, the detached child table retains its constraint — the partition constraint is converted into a CHECK constraint on the detached table.\nDETACH CONCURRENTLY limitations:\nDETACH CONCURRENTLY cannot be placed inside a transaction block. The partitioned table cannot have a DEFAULT partition. Locking behavior of CONCURRENTLY:\n-- Session 1 lzldb=\u0026gt; begin; BEGIN lzldb=*\u0026gt; insert into lzlpartition1 values(\u0026#39;1234\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-01-01 01:00:00\u0026#39;); INSERT 0 1 -- Session 2: DETACH CONCURRENTLY lzldb=\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202301 concurrently; -- Waiting -- Session 3: Check locks 3691 | insert into lzlpartition1 values(\u0026#39;1234\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-01-01 01:00:00\u0026#39;); | Client | ClientRead 3940 | alter table lzlpartition1 detach partition lzlpartition1_202301 concurrently; | Lock | virtualxid 3947 | select pid,query,wait_event_type,wait_event from pg_stat_activity; | | -- The DETACH session is 3940. Interestingly, the DETACH wait event is virtualxid, and the wait event type is Lock. -- Check lock details lzldb=\u0026gt; select locktype,database,relation,virtualtransaction,pid,mode,granted from pg_locks where pid in (3691,3940); locktype | database | relation | virtualtransaction | pid | mode | granted ---------------+----------+----------+--------------------+------+------------------+--------- virtualxid | | | 6/9 | 3940 | ExclusiveLock | t relation | 16387 | 40969 | 5/179 | 3691 | RowExclusiveLock | t relation | 16387 | 40963 | 5/179 | 3691 | RowExclusiveLock | t virtualxid | | | 5/179 | 3691 | ExclusiveLock | t virtualxid | | | 6/9 | 3940 | ShareLock | f transactionid | | | 5/179 | 3691 | ExclusiveLock | t -- At this point, DETACH is not yet waiting for a table-level lock; it is waiting for a ShareLock on virtualxid -- Session 4: Try an insert lzldb=\u0026gt; insert into lzlpartition1 values(\u0026#39;12345\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-01-01 01:00:00\u0026#39;); ERROR: no partition of relation \u0026#34;lzlpartition1\u0026#34; found for row DETAIL: Partition key of the failing row contains (date_created) = (2023-01-01 01:00:00). lzldb=\u0026gt; insert into lzlpartition1 values(\u0026#39;12345\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-02-01 01:00:00\u0026#39;); INSERT 0 1 -- The detaching partition can no longer accept inserts, but other partitions can. -- What if we insert directly into the partition? It works fine. lzldb=\u0026gt; insert into lzlpartition1_202301 values(\u0026#39;12345\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-01-01 01:00:00\u0026#39;); INSERT 0 1 -- Note: at this point it is still a partition of the partitioned table, not yet a regular table, but it has been marked as unavailable. -- \\d+ shows the partition in DETACH PENDING state Partitions: lzlpartition1_202301 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) (DETACH PENDING), lzlpartition1_202302 FOR VALUES FROM (\u0026#39;2023-02-01 00:00:00\u0026#39;) TO (\u0026#39;2023-03-01 00:00:00\u0026#39;) -- Commit/rollback the insert session (Session 1) lzldb=\u0026gt; rollback; ROLLBACK -- Session 2 completes immediately lzldb=\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202301 concurrently; ALTER TABLE FINALIZE:\n-- Session 1 lzldb=\u0026gt; begin; BEGIN lzldb=*\u0026gt; insert into lzlpartition1 values(\u0026#39;1234\u0026#39;,\u0026#39;abcd\u0026#39;,\u0026#39;2023-01-01 01:00:00\u0026#39;); INSERT 0 1 -- Session 2: DETACH CONCURRENTLY, manually canceled lzldb=\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202301 concurrently; ^CCancel request sent ERROR: canceling statement due to user request -- \\d+ shows the partition in DETACH PENDING state Partitions: lzlpartition1_202301 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) (DETACH PENDING), lzlpartition1_202302 FOR VALUES FROM (\u0026#39;2023-02-01 00:00:00\u0026#39;) TO (\u0026#39;2023-03-01 00:00:00\u0026#39;) -- In DETACH PENDING state, SQL no longer accesses this partition lzldb=\u0026gt; explain select * from lzlpartition1; QUERY PLAN ----------------------------------------------------------------------------------------- Seq Scan on lzlpartition1_202302 lzlpartition1 (cost=0.00..752.81 rows=38881 width=45) -- Use FINALIZE to complete the detach lzldb=\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202301 finalize; -- Waiting -- Check lock status lzldb=\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;;lzldb-# locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+------+--------------------------+--------- relation | lzldb | lzlpartition1 | | | 3691 | RowExclusiveLock | t relation | lzldb | lzlpartition1_202301 | | | 3940 | AccessExclusiveLock | f relation | lzldb | lzlpartition1 | | | 3940 | ShareUpdateExclusiveLock | t relation | lzldb | lzlpartition1_202301 | | | 3691 | RowExclusiveLock | t -- 3940, FINALIZE requests ShareUpdateExclusiveLock on the parent table and AccessExclusiveLock on the child table -- Since the inserted data happened to be in the detaching partition, it is waiting -- Session 1 ends lzldb=!\u0026gt; rollback; ROLLBACK -- Session 2 completes immediately lzldb=\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202301 finalize; ALTER TABLE Although DETACH requests an 8-level lock on the partition, generally business operations don\u0026rsquo;t write directly through child partitions, so you only need to ensure that long-running transactions on the partitioned table complete quickly. Usually, there\u0026rsquo;s no need to worry about subsequent blocking on that partition\u0026rsquo;s child table.\nOnline DETACH summary:\nThe blocking behavior of DETACH CONCURRENTLY is somewhat similar to CIC (CREATE INDEX CONCURRENTLY) — it does not block other transactions, but it itself waits for existing transactions to complete. This is not easily visible from lock information alone. During DETACH CONCURRENTLY, the partition enters a DETACH PENDING intermediate state. This state is somewhat like INVISIBLE — SQL will not find this partition. If DETACH PENDING is caused by long-running transactions, promptly end those transactions; if it\u0026rsquo;s caused by interruption, use FINALIZE to complete the detach. Using Constraints to Reduce ATTACH Time # Partition data overview — prepare to ATTACH a relatively large partition: =\u0026gt; SELECT tableoid::regclass AS partition, count(*) FROM lzlpartition1 group by partition; partition | count ----------------------+--------- lzlpartition1_202301 | 2592001 lzlpartition1_202302 | 38881 Note: this 202301 partition has a PARTITION CONSTRAINT:\n=\u0026gt; \\d+ lzlpartition1_202301 Table \u0026#34;public.lzlpartition1_202301\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | name | character varying(50) | | | | extended | | date_created | timestamp without time zone | | not null | now() | plain | | Partition of: lzlpartition1 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) Partition constraint: ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without t Indexes: \u0026#34;lzlpartition1_202301_pkey\u0026#34; PRIMARY KEY, btree (id, date_created) Access method: heap DETACH the partition: alter table lzlpartition1 detach partition lzlpartition1_202301; -- After DETACH, the PARTITION CONSTRAINT is gone =\u0026gt; \\d+ lzlpartition1_202301 Table \u0026#34;public.lzlpartition1_202301\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | name | character varying(50) | | | | extended | | date_created | timestamp without time zone | | not null | now() | plain | | Indexes: \u0026#34;lzlpartition1_202301_pkey\u0026#34; PRIMARY KEY, btree (id, date_created) Access method: heap ATTACH without adding a CHECK constraint: =\u0026gt; alter table lzlpartition1 attach partition lzlpartition1_202301 for values from (\u0026#39;2023-01-01 00:00:00\u0026#39;) to (\u0026#39;2023-02-01 00:00:00\u0026#39;); ALTER TABLE Time: 343.498 ms Because it must scan the partition data to verify it satisfies the partition range, ATTACH took 300+ ms.\nAdd a CHECK constraint first, then ATTACH: =\u0026gt; alter table lzlpartition1 detach partition lzlpartition1_202301; =\u0026gt; alter table lzlpartition1_202301 add constraint chk_202301 CHECK -\u0026gt; ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without time zone)); ALTER TABLE Time: 355.458 ms The time taken to add the CHECK constraint is roughly the same as the ATTACH operation without a CHECK — because adding a CHECK constraint also needs to scan and validate all data. Once the CHECK constraint is added, the subsequent ATTACH completes very quickly:\n=\u0026gt; alter table lzlpartition1 attach partition lzlpartition1_202301 for values from (\u0026#39;2023-01-01 00:00:00\u0026#39;) to (\u0026#39;2023-02-01 00:00:00\u0026#39;); ALTER TABLE Time: 1.480 ms Drop the CHECK constraint: =\u0026gt; \\d+ lzlpartition1_202301; Table \u0026#34;public.lzlpartition1_202301\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | name | character varying(50) | | | | extended | | date_created | timestamp without time zone | | not null | now() | plain | | Partition of: lzlpartition1 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) Partition constraint: ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without t Indexes: \u0026#34;lzlpartition1_202301_pkey\u0026#34; PRIMARY KEY, btree (id, date_created) Check constraints: \u0026#34;chk_202301\u0026#34; CHECK (date_created IS NOT NULL AND date_created \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone AND date_created \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without time Access method: heap Note: CHECK CONSTRAINT and PARTITION CONSTRAINT are different concepts, even though their constraint content can be identical. ATTACH uses the CHECK constraint but does not merge it. You can explicitly drop this redundant CHECK:\n=\u0026gt; alter table lzlpartition1_202301 drop constraint chk_202301; ALTER TABLE Additionally, note that DROP CONSTRAINT requests an AccessExclusiveLock on the current child partition — this is the highest-level lock and blocks all operations. So, if there are transactions on that child partition, be cautious with DROP CONSTRAINT.\n=\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+---------------------+--------- relation | dbmgr | lzlpartition1 | [null] | [null] | 448243 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 448243 | RowExclusiveLock | t relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 444399 | AccessShareLock | t relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 444399 | AccessExclusiveLock | f -- This is the DROP CONSTRAINT session relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 448243 | RowExclusiveLock | t So, When ATTACH-ing a partition, adding a CHECK constraint beforehand is useful — it reduces ATTACH execution time. The data validation just needs to be completed before ATTACH.\nThe Correct Way to Add Partitions to a Partitioned Table # We now know that ATTACH can be executed online, while PARTITION OF / DROP TABLE / DETACH all request an AccessExclusiveLock that waits for and blocks everything. So, It is recommended to use ATTACH to create new partitions. PARTITION OF / DETACH both wait for and block all transactions, while ATTACH is not blocked by read-only/DML transactions. Therefore, adding partitions should use ATTACH, and a CHECK constraint should be created beforehand. When dropping constraints, be mindful of long-running transactions. The correct way to add a partition to a partitioned table:\n-- To reduce tedious DDL, use LIKE to create the table CREATE TABLE lzlpartition1_202303 (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS); -- Refer to the PARTITION CONSTRAINT of other partitions, add a CHECK constraint on the table to reduce ATTACH constraint validation time alter table lzlpartition1_202303 add constraint chk_202303 CHECK ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-03-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-04-01 00:00:00\u0026#39;::timestamp without time zone)); -- Add partition using ATTACH alter table LZLPARTITION1 attach partition LZLPARTITION1_202303 for values from (\u0026#39;2023-03-01 00:00:00\u0026#39;) to (\u0026#39;2023-04-01 00:00:00\u0026#39;); -- Optional. Drop the redundant CHECK constraint before transactions start on the new partition alter table lzlpartition1_202303 drop constraint chk_202303; Locks on Partition Indexes # Creating/dropping partition indexes during read-only transactions When a partition has a shared lock (AccessShareLock), meaning there is a query transaction on the partitioned table: CREATE INDEX ON lzlpartition1 succeeds (note: without CONCURRENTLY); DROP INDEX lzlpartition1 fails:\n-- Session 1: Start a transaction, read data from the partitioned table =\u0026gt; begin; BEGIN =\u0026gt; select count(*) from lzlpartition1 where date_created\u0026gt;=\u0026#39;2023-01-01 00:00:00\u0026#39; and date_created\u0026lt;=\u0026#39;2023-01-02 00:00:00\u0026#39;; count ------- 86401 (1 row) -- Session 2: Create index, succeeds =\u0026gt; create index idx_datecreated on lzlpartition1(date_created);; CREATE INDEX -- Session 2: Drop index, waits =\u0026gt; drop index idx_datecreated; -- Session 3: Check locks =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+---------------------------+------------+---------------+--------+---------------------+--------- relation | dbmgr | lzlpartition1_202301_pkey | [null] | [null] | 300371 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 99598 | AccessExclusiveLock | f relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 300371 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 300371 | AccessShareLock | t CREATE INDEX does not request an AccessExclusiveLock on the table, but DROP INDEX does. From this example we can conclude: Read-only transactions do not block CREATE INDEX, but they do block DROP INDEX.\nCreating/dropping partition indexes during update transactions -- Session 1: Start an update transaction =\u0026gt; begin; BEGIN =\u0026gt; update lzlpartition1 set name=\u0026#39;abc\u0026#39; where date_created=\u0026#39;2023-01-01 10:00:00\u0026#39;; UPDATE 1 -- Session 2: Create partition index, waits =\u0026gt; create index idx_datecreated on lzlpartition1(date_created); -- Session 3: Check lock status =\u0026gt;select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid -\u0026gt; where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+---------------------------+------------+---------------+--------+------------------+--------- relation | dbmgr | lzlpartition1_202301_pkey | [null] | [null] | 300371 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 99598 | ShareLock | f relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 300371 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 300371 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 300371 | RowExclusiveLock | t The CREATE INDEX session (99598) requests a ShareLock on the partition parent table; the DML transaction session (300371) holds RowExclusiveLock on the child partition and parent table. CREATE INDEX (without CONCURRENTLY) requests ShareLock on the parent table; Read-only transactions request AccessShareLock on the parent and child tables; Update transactions request RowExclusiveLock on the parent and child tables; ==\u0026gt; AccessShareLock does not block ShareLock, so queries do not block CREATE INDEX (without CONCURRENTLY); RowExclusiveLock blocks ShareLock, so DML blocks CREATE INDEX (without CONCURRENTLY);\nCreating partitioned indexes with CONCURRENTLY Note: You cannot create indexes with CONCURRENTLY on a partitioned table.\n=\u0026gt; create index concurrently idx_datecreated on lzlpartition1(date_created); ERROR: 0A000: cannot create index on partitioned table \u0026#34;lzlpartition1\u0026#34; concurrently LOCATION: DefineIndex, indexcmds.c:665 There is a patch at https://commitfest.postgresql.org/35/2815/ working on solving this issue.\nCurrently, you can create indexes with CONCURRENTLY on individual partition child tables:\n-- Session 1: Still using the previous DML transaction -- Session 2: Create index with CONCURRENTLY on a child table, waits =\u0026gt; create index concurrently idx_datecreated_202301 on lzlpartition1_202301(date_created); -- Session 3: Check lock status =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+---------------------------+------------+---------------+--------+--------------------------+--------- relation | dbmgr | lzlpartition1_202301_pkey | [null] | [null] | 300371 | RowExclusiveLock | t relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 99598 | ShareUpdateExclusiveLock | t relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 300371 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 300371 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 300371 | RowExclusiveLock | t With CONCURRENTLY, the requested lock is one level lower and no longer conflicts with ROW EXCL. The locks don\u0026rsquo;t conflict, so why is CONCURRENTLY itself still blocked?\nit must wait for all existing transactions that could potentially modify or use the index to terminate.\nThe official documentation explains that CONCURRENTLY needs to wait for transactions that could potentially modify or use the index to terminate. In our case, the UPDATE statement modified the indexed column, so CONCURRENTLY needs to wait for it to complete. Although CONCURRENTLY itself hasn\u0026rsquo;t completed due to the prior DML statement, there\u0026rsquo;s a benefit: CONCURRENTLY does not block subsequent DML statements.\n-- While CONCURRENTLY has not yet completed -- Session 4: Update a record =\u0026gt; update lzlpartition1 set name=\u0026#39;abc\u0026#39; where date_created=\u0026#39;2023-01-01 12:00:00\u0026#39;; UPDATE 1 Summary of partition index locking issues:\nLocking for read-only/read-write/index creation on partitioned tables is similar to regular tables. Just note that transactions acquire locks on both the partition parent table and child tables, so when subsequent blocking chains involve heavier locks, all partitions are affected. Read-only transactions do not block CREATE INDEX, but they do block DROP INDEX. DML blocks CREATE INDEX and also blocks CREATE INDEX CONCURRENTLY, but CONCURRENTLY does not block DML. Although CREATE INDEX on a partitioned table automatically creates indexes on all existing and future partitions, it is not recommended for direct use in production due to blocking issues. You cannot use CONCURRENTLY directly on the partition parent table, so you need to create indexes with CONCURRENTLY on each partition child table. CONCURRENTLY does not block subsequent transactions but itself gets blocked by prior long-running transactions and may cause the created index to be invalid. Attention must be paid to long-running transactions. The Correct Way to Create Partition Indexes # Although you cannot create indexes with CONCURRENTLY on a partitioned table, you can create indexes with CONCURRENTLY on partition child tables using the following syntax: CREATE INDEX ON ONLY : Creates an invalid index on the parent table; does not automatically create indexes on child partitions. CREATE INDEX CONCURRENTLY : Creates an index with CONCURRENTLY on a child partition. ALTER INDEX .. ATTACH PARTITION : Attaches the partition index to the parent index. After all child partition indexes have been attached, the partition parent table index is automatically marked as valid. However, when executing these commands, you still need to pay attention to locking behavior.\nBelow, observe the lock requests and blocking for the above two statements: (DML explicit transaction in Session 1 is kept open throughout)\nBlocking behavior of CREATE INDEX ON ONLY: =\u0026gt; CREATE INDEX IDX_DATECREATED ON ONLY lzlpartition1(date_created); -- Waiting -- Check lock status locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+--------+------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 448243 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 448243 | RowExclusiveLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 444399 | ShareLock | f CREATE INDEX ON ONLY requests a ShareLock. ShareLock and RowExclusiveLock block each other. So, although ONLY itself executes very quickly, CREATE INDEX ON ONLY should not be used casually either.\n-- After the DML transaction ends, CREATE INDEX ON ONLY completes \u0026#34;idx_datecreated\u0026#34; btree (date_created) INVALID CREATE INDEX ON ONLY creates an invalid index on the partition parent table and does not create indexes on child partitions.\nBlocking behavior of ATTACH index: -- After ONLY index creation completes, start another DML explicit transaction in Session 1 =\u0026gt; begin; BEGIN =\u0026gt; insert into lzlpartition1 values(\u0026#39;1111\u0026#39;,\u0026#39;abc\u0026#39;,\u0026#39;2023-01-01 00:00:00\u0026#39;); INSERT 0 1 -- Session 2: Create index with CONCURRENTLY on child partition =\u0026gt; create index concurrently idx_datecreated_202302 on lzlpartition1_202302(date_created); CREATE INDEX -- 202302 partition index created =\u0026gt; create index concurrently idx_datecreated_202304 on lzlpartition1_202304(date_created); CREATE INDEX -- 202304 partition index created =\u0026gt; create index concurrently idx_datecreated_202301 on lzlpartition1_202301(date_created); ---- Creating 202301 partition index, waiting CONCURRENTLY waits for transactions that might use the index to complete. Our explicit transaction only inserted into the 202301 partition, so only this partition\u0026rsquo;s CONCURRENTLY index creation hasn\u0026rsquo;t completed.\n-- Complete the DML explicit transaction in Session 1, wait for the index to finish, then start another transaction =\u0026gt; commit; COMMIT =\u0026gt; begin; BEGIN =\u0026gt; insert into lzlpartition1 values(\u0026#39;1111\u0026#39;,\u0026#39;abc\u0026#39;,\u0026#39;2023-01-01 00:00:01\u0026#39;); INSERT 0 1 -- Session 2: ATTACH index =\u0026gt; ALTER INDEX idx_datecreated ATTACH PARTITION idx_datecreated_202302; ALTER INDEX -- ATTACH successful =\u0026gt; \\d+ idx_datecreated Partitioned index \u0026#34;public.idx_datecreated\u0026#34; Column | Type | Key? | Definition | Storage | Stats target --------------+-----------------------------+------+--------------+---------+-------------- date_created | timestamp without time zone | yes | date_created | plain | btree, for table \u0026#34;public.lzlpartition1\u0026#34;, invalid Partitions: idx_datecreated_202302 -- 202302 child partition index has been attached, index still invalid Access method: btree -- Attach the remaining child partition indexes =\u0026gt; ALTER INDEX idx_datecreated ATTACH PARTITION idx_datecreated_202301; ALTER INDEX -- ATTACH successful =\u0026gt; ALTER INDEX idx_datecreated ATTACH PARTITION idx_datecreated_202304; ALTER INDEX -- ATTACH successful -- After all child partition indexes are attached, the parent table index automatically becomes valid =\u0026gt; \\d+ idx_datecreated Partitioned index \u0026#34;public.idx_datecreated\u0026#34; Column | Type | Key? | Definition | Storage | Stats target --------------+-----------------------------+------+--------------+---------+-------------- date_created | timestamp without time zone | yes | date_created | plain | btree, for table \u0026#34;public.lzlpartition1\u0026#34; Partitions: idx_datecreated_202301, idx_datecreated_202302, idx_datecreated_202304 Access method: btree ATTACH is not blocked by DML and completes immediately. At this point, new partitions created via PARTITION OF will also automatically get the child partition index.\nIn summary,\nCREATE INDEX ON ONLY requests a ShareLock, which mutually blocks with the RowExclusiveLock requested by DML. CREATE INDEX CONCURRENTLY requests a ShareUpdateExclusiveLock, which does not block the RowExclusiveLock requested by DML. However, CREATE INDEX CONCURRENTLY needs to wait for DML transactions to complete before it can finish (CONCURRENTLY can acquire the lock but cannot complete). ALTER INDEX .. ATTACH PARTITION requests an AccessShareLock, which is the lightest lock and does not block the RowExclusiveLock requested by DML. Queries request AccessShareLock, the lightest lock. Unless DDL requests AccessExclusiveLock (the heaviest lock), blocking does not occur. Therefore, directly running CREATE INDEX on a partition blocks DML and is not acceptable. The correct way to create partition indexes:\n-- Use ONLY to create an invalid index on the partition parent table. Fast, but blocks subsequent DML, affects business — watch for long-running transactions. CREATE INDEX IDX_DATECREATED ON ONLY lzlpartition1(date_created); -- Use CONCURRENTLY to create indexes on each partition child table. Slow, does not block subsequent DML, does not affect business, but watch for long-running DML transactions to prevent failure. create index concurrently idx_datecreated_202302 on lzlpartition1_202302(date_created); -- ATTACH all indexes. Fast, does not cause business blocking. ALTER INDEX idx_datecreated ATTACH PARTITION idx_datecreated_202302; Adding Primary Keys and Unique Indexes to Partitioned Tables # A \u0026ldquo;primary key index\u0026rdquo; is functionally equivalent to \u0026ldquo;unique index + NOT NULL constraint\u0026rdquo; (but there can only be one primary key). Creating unique indexes on partitioned tables can follow the index creation best practices above: ONLY on parent, CONCURRENTLY on children, ATTACH. However, while primary keys on regular tables support the USING INDEX syntax, partitioned tables currently do not support this:\n=\u0026gt; ALTER TABLE lzlpartition1 ADD CONSTRAINT pk_id_date_created PRIMARY KEY USING INDEX idx_uniq; ERROR: 0A000: ALTER TABLE / ADD CONSTRAINT USING INDEX is not supported on partitioned tables LOCATION: ATExecAddIndexConstraint, tablecmds.c:8032 In other words, you can create a NOT NULL unique index by pre-creating a NOT NULL constraint + ATTACH-ing indexes, but the final step of USING INDEX to add the primary key does not work.\nNow let\u0026rsquo;s look at the blocking behavior of directly adding/dropping primary keys:\nDirectly dropping a primary key: -- Session 1 =\u0026gt; begin; BEGIN Time: 0.318 ms =\u0026gt; select * from lzlpartition1 where date_created=\u0026#39;2023-01-01 22:00:00\u0026#39;; id | name | date_created ------+----------------------------------+--------------------- 7715 | beee680a86e1d12790489e9ab4a4351b | 2023-01-01 22:00:00 -- Session 2: Drop primary key, waits =\u0026gt; alter table lzlpartition1 drop constraint lzlpartition1_pkey; -- Session 3: Observe =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+---------------------------+------------+---------------+-------+---------------------+--------- relation | dbmgr | lzlpartition1_202301_pkey | [null] | [null] | 21659 | AccessShareLock | t relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 21659 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 95016 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 95016 | AccessExclusiveLock | f relation | dbmgr | lzlpartition1 | [null] | [null] | 21659 | AccessShareLock | t Dropping a primary key requests an AccessExclusiveLock, blocking everything.\nDirectly adding a primary key: -- Session 1 transaction ends; Session 2\u0026#39;s drop primary key completes -- Session 1 starts another read-only transaction -- Session 2: Add a primary key on the partitioned table, waits =\u0026gt; ALTER TABLE lzlpartition1 ADD PRIMARY KEY(id, date_created); -- Session 3: Observe locks =\u0026gt; select l.locktype,d.datname,r.relname,l.virtualxid,l.transactionid,l.pid,l.mode,l.granted from pg_locks l left join pg_database d on l.database=d.oid left join pg_class r on l.relation=r.oid where relname like \u0026#39;%lzlpartition1%\u0026#39;; locktype | datname | relname | virtualxid | transactionid | pid | mode | granted ----------+---------+----------------------+------------+---------------+-------+---------------------+--------- relation | dbmgr | lzlpartition1_202301 | [null] | [null] | 21659 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 95016 | AccessShareLock | t relation | dbmgr | lzlpartition1 | [null] | [null] | 95016 | AccessExclusiveLock | f -- Session adding primary key relation | dbmgr | lzlpartition1 | [null] | [null] | 21659 | AccessShareLock | t Adding a primary key requests an AccessExclusiveLock on the parent table, blocking everything. Adding an index on a partitioned table is very slow, and a primary key causes subsequent blocking. Currently, there is no low-impact way to add a primary key on a partitioned table. As a workaround, you can consider using the \u0026ldquo;ATTACH unique index + NOT NULL constraint\u0026rdquo; approach; or you may have to schedule a long maintenance window for the partitioned table business and wait for index creation to complete; or use a third-party sync tool to insert data into a partitioned table that already has the primary key.\nAdding Partitions to HASH Partitioned Tables # If the new number of partitions is an integer multiple of the old number, we can know which old partition the data in the new partition came from. For example, expanding a 3-partition HASH partitioned table to 6 partitions, we can determine the data source: Although understanding this simple data characteristic is helpful, in practice it may not be very useful, because new HASH partitions are always populated by brute-force INSERT. In terms of operations, going from \u0026ldquo;3→4\u0026rdquo; partitions is no different from \u0026ldquo;3→6\u0026rdquo;. Mature data sync tools are now widely available. For example, using DTS to insert the table into a new table and then performing a table switch — this results in very short downtime and should be the preferred approach in production. Below is primarily testing and observing the manual addition of integer-multiple partitions to a HASH partitioned table:\nPartition info: SELECT tableoid::regclass,count(*) FROM orders group by tableoid::regclass; tableoid | count -----------+------- orders_p1 | 3377 orders_p3 | 3354 orders_p2 | 3369 2. DETACH partitions: Adding 3 more partitions to a 3-partition HASH native partitioned table: ALTER TABLE orders DETACH PARTITION orders_p1; ALTER TABLE orders DETACH PARTITION orders_p2; ALTER TABLE orders DETACH PARTITION orders_p3; RENAME partitions: ALTER TABLE orders_p1 RENAME TO bak_orders_p1; ALTER TABLE orders_p2 RENAME TO bak_orders_p2; ALTER TABLE orders_p3 RENAME TO bak_orders_p3; Create 6 HASH partitions on the old table: CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (MODULUS 6, REMAINDER 0); CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (MODULUS 6, REMAINDER 1); CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (MODULUS 6, REMAINDER 2); CREATE TABLE orders_p4 PARTITION OF orders FOR VALUES WITH (MODULUS 6, REMAINDER 3); CREATE TABLE orders_p5 PARTITION OF orders FOR VALUES WITH (MODULUS 6, REMAINDER 4); CREATE TABLE orders_p6 PARTITION OF orders FOR VALUES WITH (MODULUS 6, REMAINDER 5); View partition info: Note the function used in the partition constraint: \\d+ orders_p1 Table \u0026#34;public.orders_p1\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------+-----------+----------+---------+----------+--------------+------------- order_id | integer | | | | plain | | name | character varying(10) | | | | extended | | Partition of: orders FOR VALUES WITH (modulus 6, remainder 0) Partition constraint: satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 0, order_id) Access method: heap Calculate which new partition old partition data should be inserted into. For example, the old modulus 3, remainder 0 partition\u0026rsquo;s data needs to be split into the modulus 6, remainder 0 and remainder 3 partitions:\nselect count(*) from bak_orders_p1 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 0, order_id)=true; count ------- 1776 select count(*) from bak_orders_p1 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 3, order_id)=true; count ------- 1601 select count(*) from bak_orders_p1; count ------- 3377 Insert data directly into partition child tables: You can insert data directly into the corresponding partition child tables rather than through the partition parent table: INSERT INTO orders_p1 SELECT * FROM bak_orders_p1 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 0, order_id)=true; INSERT INTO orders_p2 SELECT * FROM bak_orders_p2 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 1, order_id)=true; INSERT INTO orders_p3 SELECT * FROM bak_orders_p3 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 2, order_id)=true; INSERT INTO orders_p4 SELECT * FROM bak_orders_p1 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 3, order_id)=true; INSERT INTO orders_p5 SELECT * FROM bak_orders_p2 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 4, order_id)=true; INSERT INTO orders_p6 SELECT * FROM bak_orders_p3 where satisfies_hash_partition(\u0026#39;412053\u0026#39;::oid, 6, 5, order_id)=true; Verify data from 3 old partitions has been inserted into 6 new partitions: SELECT tableoid::regclass,count(*) FROM orders group by tableoid::regclass; tableoid | count -----------+------- orders_p3 | 1665 orders_p5 | 1678 orders_p1 | 1776 orders_p6 | 1689 orders_p4 | 1601 orders_p2 | 1691 Changing Column Length on Partitioned Tables Rebuilds Indexes # Modifying a column involves three considerations: table rewrite, index rebuild, and statistics loss.\nChanging column type or reducing column length rewrites the table. Increasing column length only causes statistics loss; an exception is reducing the length (or changing int4 to int8), which rewrites the table. Increasing column length does not rebuild indexes, with one exception: increasing column length on a partitioned table rebuilds indexes (if the column has an index). For column modifications, refer to the PostgreSQL apprentice.\nHere we mainly test the scenario of increasing column length on a partitioned table. If an index exists, it may cause transaction blocking on the partitioned table. Regular table, increasing the length of an indexed column:\n-- Create regular table and index =\u0026gt; create table t111(id int,name varchar(50)); CREATE TABLE =\u0026gt; insert into t111 values(1001,\u0026#39;abc\u0026#39;); INSERT 0 1 =\u0026gt; create index idx111 on t111(name); CREATE INDEX -- Index file relfilenode is 417728 select pg_relation_filepath(\u0026#39;idx111\u0026#39;); pg_relation_filepath ---------------------- base/16398/417728 (1 row) -- Increase column length =\u0026gt; alter table t111 alter column name type varchar(60); ALTER TABLE -- Index file relfilenode is still 417728, unchanged. Regular table index was NOT rebuilt. =\u0026gt; select pg_relation_filepath(\u0026#39;idx111\u0026#39;); pg_relation_filepath ---------------------- base/16398/417728 Partitioned table, increasing the length of an indexed column:\n-- Create an index on the partitioned table =\u0026gt; create index idx_name on lzlpartition1(name); CREATE INDEX -- Check the index on one partition =\u0026gt; \\d+ lzlpartition1_202301 Table \u0026#34;dbmgr.lzlpartition1_202301\u0026#34; Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | | | plain | | name | character varying(50) | | | | extended | | date_created | timestamp without time zone | | not null | now() | plain | | Partition of: lzlpartition1 FOR VALUES FROM (\u0026#39;2023-01-01 00:00:00\u0026#39;) TO (\u0026#39;2023-02-01 00:00:00\u0026#39;) Partition constraint: ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-02-01 00:00:00\u0026#39;::timestamp without time zone)) Indexes: \u0026#34;lzlpartition1_202301_name_idx\u0026#34; btree (name) Access method: heap =\u0026gt; select pg_relation_filepath(\u0026#39;lzlpartition1_202301_name_idx\u0026#39;) idx,pg_relation_filepath(\u0026#39;lzlpartition1_202301\u0026#39;) tbl; idx | tbl -------------------+------------------- base/16398/417810 | base/16398/417800 (1 row) -- Increase the indexed column length — partitioned table index is rebuilt =\u0026gt; alter table lzlpartition1 alter column name type varchar(60); ALTER TABLE =\u0026gt; select pg_relation_filepath(\u0026#39;lzlpartition1_202301_name_idx\u0026#39;) idx,pg_relation_filepath(\u0026#39;lzlpartition1_202301\u0026#39;) tbl; idx | tbl -------------------+------------------- base/16398/417814 | base/16398/417800 -- Reduce the indexed column length — partitioned table is rewritten =\u0026gt; alter table lzlpartition1 alter column name type varchar(40); ALTER TABLE Time: 609.585 ms =\u0026gt; select pg_relation_filepath(\u0026#39;lzlpartition1_202301_name_idx\u0026#39;) idx,pg_relation_filepath(\u0026#39;lzlpartition1_202301\u0026#39;) tbl; idx | tbl -------------------+------------------- base/16398/417828 | base/16398/417825 -- Keep the indexed column length the same — partitioned table index is still rebuilt =\u0026gt; alter table lzlpartition1 alter column name type varchar(40); ALTER TABLE =\u0026gt; select pg_relation_filepath(\u0026#39;lzlpartition1_202301_name_idx\u0026#39;) idx,pg_relation_filepath(\u0026#39;lzlpartition1_202301\u0026#39;) tbl; idx | tbl -------------------+------------------- base/16398/417834 | base/16398/417825 For regular tables, increasing column length only requires attention to statistics loss (except int to bigint). However, for partitioned tables, when increasing column length, if the column has an index, not only are statistics lost but the index is also rebuilt. Since ALTER COLUMN is an 8-level lock, the index rebuild period causes extended blocking. Recommendation: first drop the index, modify the column, then rebuild the index using the \u0026ldquo;parent table ONLY + child tables CIC + ATTACH\u0026rdquo; approach.\nPartition Table Maintenance Summary # PARTITION OF / DROP TABLE / DETACH require ACCESS EXCLUSIVE locks. ATTACH / DETACH CONCURRENTLY are recommended — they do not cause blocking. For DETACH CONCURRENTLY, watch for existing long-running transactions. Before ATTACH-ing a partition, you can pre-create a constraint on the partition. This eliminates the time spent scanning partition data during ATTACH. Currently, CIC (CREATE INDEX CONCURRENTLY) is not supported on partitioned tables. You can create partition indexes using the \u0026ldquo;ONLY on parent + CONCURRENTLY on children + ATTACH index\u0026rdquo; approach to reduce business blocking time. Partitioned tables do not support the USING INDEX method for creating primary keys. Pay attention to the exceptional case of modifying column length on partitioned tables. Partition Table Optimization # Partition Pruning # Partition Pruning can improve performance for declarative partitioning and is a very important feature for partitioned table optimization. Without partition pruning, queries would scan all partitions. With partition pruning, the optimizer can filter out partitions that don\u0026rsquo;t need to be accessed through the WHERE condition. Partition pruning relies on the PARTITION CONSTRAINT (visible with \\d+), which means queries must include partition key conditions for pruning to occur. This constraint differs from regular CHECK constraints — it is automatically created when the partition is created. Partition pruning is controlled by the enable_partition_pruning parameter, which defaults to on.\n-- Without partition pruning, all partitions are accessed =\u0026gt; set enable_partition_pruning=off; SET =\u0026gt; explain select count(*) from lzlpartition1 where date_created=\u0026#39;2023-01-01\u0026#39;; QUERY PLAN -------------------------------------------------------------------------------------------------- Aggregate (cost=1872.08..1872.09 rows=1 width=8) -\u0026gt; Append (cost=0.00..1872.07 rows=4 width=0) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1_1 (cost=0.00..992.30 rows=1 width=0) Filter: (date_created = \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_2 (cost=0.00..864.12 rows=1 width=0) Filter: (date_created = \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_3 (cost=0.00..15.62 rows=2 width=0) Filter: (date_created = \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) -- With partition pruning enabled, partitions that don\u0026#39;t need to be accessed are excluded =\u0026gt; set enable_partition_pruning=on; SET =\u0026gt; explain select count(*) from lzlpartition1 where date_created=\u0026#39;2023-01-01\u0026#39;; QUERY PLAN ------------------------------------------------------------------------------------------ Aggregate (cost=992.30..992.31 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1 (cost=0.00..992.30 rows=1 width=0) Filter: (date_created = \u0026#39;2023-01-01 00:00:00\u0026#39;::timestamp without time zone) (3 rows) (The official documentation says pruning happens during execution plan generation, and EXPLAIN would show \u0026ldquo;Subplans Removed.\u0026rdquo; In testing, this isn\u0026rsquo;t always the case, as in the EXPLAIN example above.) Partition pruning can occur at two stages: during execution plan generation, and during actual execution. Why does this happen? Because sometimes only at execution time can we know which partitions can be pruned. There are two scenarios:\nParameterized Nested Loop Joins: The parameter from the outer side of the join can be used to determine the minimum set of inner side partitions to scan.\nInitplans: Once an initplan has been executed we can then determine which partitions match the value from the initplan.\nSimulating runtime pruning: When fetching data from another table, the optimizer certainly doesn\u0026rsquo;t know what the data is, so it cannot use that as a basis for partition pruning during plan generation:\n-- Create another table =\u0026gt; create table x(date_created timestamp); CREATE TABLE =\u0026gt; insert into x values(\u0026#39;2023-01-01 09:00:00\u0026#39;); INSERT 0 1 -- Generate execution plan only, don\u0026#39;t execute — no pruning occurred =\u0026gt; explain select count(*) from lzlpartition1 where date_created=(select date_created from x); QUERY PLAN -------------------------------------------------------------------------------------------------- Aggregate (cost=1904.68..1904.69 rows=1 width=8) InitPlan 1 (returns $0) -\u0026gt; Seq Scan on x (cost=0.00..32.60 rows=2260 width=8) -\u0026gt; Append (cost=0.00..1872.07 rows=4 width=0) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1_1 (cost=0.00..992.30 rows=1 width=0) Filter: (date_created = $0) -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_2 (cost=0.00..864.12 rows=1 width=0) Filter: (date_created = $0) -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_3 (cost=0.00..15.62 rows=2 width=0) Filter: (date_created = $0) (10 rows) -- Execute the SQL — pruning occurred. Notice the \u0026#34;never executed\u0026#34; keyword. =\u0026gt; explain analyze select count(*) from lzlpartition1 where date_created=(select date_created from x); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1904.68..1904.69 rows=1 width=8) (actual time=5.680..5.682 rows=1 loops=1) InitPlan 1 (returns $0) -\u0026gt; Seq Scan on x (cost=0.00..32.60 rows=2260 width=8) (actual time=0.013..0.014 rows=1 loops=1) -\u0026gt; Append (cost=0.00..1872.07 rows=4 width=0) (actual time=0.029..5.676 rows=2 loops=1) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1_1 (cost=0.00..992.30 rows=1 width=0) (actual time=0.008..5.652 rows=2 loops=1) Filter: (date_created = $0) Rows Removed by Filter: 45382 -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_2 (cost=0.00..864.12 rows=1 width=0) (never executed) Filter: (date_created = $0) -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_3 (cost=0.00..15.62 rows=2 width=0) (never executed) Filter: (date_created = $0) Planning Time: 0.157 ms Execution Time: 5.732 ms (13 rows) Partition Wise Join # Partition wise join can reduce the cost of partition joins. Suppose there are two partitioned tables t1 and t2, both with 3 partitions (p1, p2, p3) with identical partition definitions. t1 has 10 rows per partition, t2 has 20 rows per partition:\nt1 t2 p1 10 rows 20 rows p2 10 rows 20 rows p3 10 rows 20 rows When t1 and t2 join, Normally, all data from both partitioned tables needs to be extracted for joining. The number of row comparison operations would be: (10+10+10)*(20+20+20)=180 With partition wise join, since the structures are similar, only corresponding partitions need to be joined, e.g.: t1.p1\u0026lt;=\u0026gt;t2.p1, t1.p2\u0026lt;=\u0026gt;t2.p2, t1.p3\u0026lt;=\u0026gt;t2.p3, The number of row comparison operations becomes: (10*20)*3=90 When there are many partitions, the cost savings of partition wise join are significant. Parameter enable_partitionwise_join: whether to enable partition wise join, default is off.\nThe prerequisites for partition wise join are very strict:\nThe join condition must include the partition key. The partition keys must be of the same data type. Partitions must correspond one-to-one. While these conditions seem strict, it\u0026rsquo;s relatively rare for tables with different purposes to produce partition wise join scenarios. A common case would be both tables using RANGE time partitioning. Another scenario: a partitioned table self-joining also meets partition wise join prerequisites:\n-- Without partition wise join enabled =\u0026gt; explain select p1.*,p2.name from lzlpartition1 p1,lzlpartition1 p2 where p1.date_created=p2.date_created and p2.name=\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Hash Join (cost=546.64..9256.34 rows=182252 width=288) Hash Cond: (p1.date_created = p2.date_created) -\u0026gt; Append (cost=0.00..2085.46 rows=85364 width=150) -\u0026gt; Seq Scan on lzlpartition1_202301 p1_1 (cost=0.00..878.84 rows=45384 width=150) -\u0026gt; Seq Scan on lzlpartition1_202302 p1_2 (cost=0.00..765.30 rows=39530 width=150) -\u0026gt; Seq Scan on lzlpartition1_202304 p1_3 (cost=0.00..14.50 rows=450 width=150) -\u0026gt; Hash (cost=541.30..541.30 rows=427 width=146) -\u0026gt; Append (cost=7.17..541.30 rows=427 width=146) -\u0026gt; Bitmap Heap Scan on lzlpartition1_202301 p2_1 (cost=7.17..284.30 rows=227 width=146) Recheck Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Index Scan on lzlpartition1_202301_name_idx (cost=0.00..7.12 rows=227 width=0) Index Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Heap Scan on lzlpartition1_202302 p2_2 (cost=6.95..248.52 rows=198 width=146) Recheck Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Index Scan on lzlpartition1_202302_name_idx (cost=0.00..6.90 rows=198 width=0) Index Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Heap Scan on lzlpartition1_202304 p2_3 (cost=2.66..6.35 rows=2 width=146) Recheck Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Index Scan on lzlpartition1_202304_name_idx (cost=0.00..2.66 rows=2 width=0) Index Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) (20 rows) -- With partition wise join enabled =\u0026gt; set enable_partitionwise_join =on; SET M=\u0026gt; explain select p1.*,p2.name from lzlpartition1 p1,lzlpartition1 p2 where p1.date_created=p2.date_created and p2.name=\u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Append (cost=287.14..2529.83 rows=438 width=288) -\u0026gt; Hash Join (cost=287.14..1338.49 rows=232 width=288) Hash Cond: (p1_1.date_created = p2_1.date_created) -\u0026gt; Seq Scan on lzlpartition1_202301 p1_1 (cost=0.00..878.84 rows=45384 width=150) -\u0026gt; Hash (cost=284.30..284.30 rows=227 width=146) -\u0026gt; Bitmap Heap Scan on lzlpartition1_202301 p2_1 (cost=7.17..284.30 rows=227 width=146) Recheck Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Index Scan on lzlpartition1_202301_name_idx (cost=0.00..7.12 rows=227 width=0) Index Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Hash Join (cost=250.99..1166.55 rows=202 width=288) Hash Cond: (p1_2.date_created = p2_2.date_created) -\u0026gt; Seq Scan on lzlpartition1_202302 p1_2 (cost=0.00..765.30 rows=39530 width=150) -\u0026gt; Hash (cost=248.52..248.52 rows=198 width=146) -\u0026gt; Bitmap Heap Scan on lzlpartition1_202302 p2_2 (cost=6.95..248.52 rows=198 width=146) Recheck Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Index Scan on lzlpartition1_202302_name_idx (cost=0.00..6.90 rows=198 width=0) Index Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Hash Join (cost=6.37..22.60 rows=4 width=288) Hash Cond: (p1_3.date_created = p2_3.date_created) -\u0026gt; Seq Scan on lzlpartition1_202304 p1_3 (cost=0.00..14.50 rows=450 width=150) -\u0026gt; Hash (cost=6.35..6.35 rows=2 width=146) -\u0026gt; Bitmap Heap Scan on lzlpartition1_202304 p2_3 (cost=2.66..6.35 rows=2 width=146) Recheck Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) -\u0026gt; Bitmap Index Scan on lzlpartition1_202304_name_idx (cost=0.00..2.66 rows=2 width=0) Index Cond: ((name)::text = \u0026#39;256ac66bb53d31bc6124294238d6410c\u0026#39;::text) (25 rows) Without partition wise join enabled, the optimizer first accesses all partition data from p2 (matching the filter) and combines them (Append), then Hash Joins with all partition data from p1 through the partition key. With partition wise join enabled, the optimizer joins corresponding partitions from p1 and p2 (actually the same table accessed twice): p1_1\u0026lt;=\u0026gt;p2_1 Hash Join p1_2\u0026lt;=\u0026gt;p2_2 Hash Join p1_3\u0026lt;=\u0026gt;p2_3 Hash Join Then combines the data together (Append). If there are enough data partitions, combined with partition pruning, partition wise join can have very good optimization effects.\nPartition Wise Grouping/Aggregation # When performing aggregation on partitioned data, partitions can each compute independently — there is no need to scan all partition data for aggregation. Each partition computes its own aggregation, then the results are collected and returned. Without partition wise grouping, it\u0026rsquo;s essentially \u0026ldquo;scan all partitions first, then aggregate.\u0026rdquo; With partition wise grouping, it\u0026rsquo;s \u0026ldquo;aggregate per partition first, then combine results.\u0026rdquo;\nAdvantages of partition wise grouping:\nWhen partitions are on foreign servers, the aggregation operator can be pushed down to the foreign server. When aggregating into hash tables, each partition rather than the entire table uses the memory hash table space, reducing memory usage. Aggregation algorithms pushed down to individual partitions can better utilize features like indexes and parallelism. Fewer data comparisons. Although data scanning is the same, there are fewer data comparisons — for example, data from the last partition does not need to be compared with data from the first partition. Parameter enable_partitionwise_aggregate: whether to enable partition wise grouping/aggregation, default is off.\nPartition wise aggregate example:\n=\u0026gt; vacuum (analyze) lzlpartition1; -- Without wise agg =\u0026gt; set enable_partitionwise_aggregate =off; SET =\u0026gt; explain select date_created,min(id),count(*) from lzlpartition1 group by date_created order by 1,2,3; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Sort (cost=10354.94..10562.89 rows=83180 width=20) Sort Key: lzlpartition1.date_created, (min(lzlpartition1.id)), (count(*)) -\u0026gt; HashAggregate (cost=2725.69..3557.49 rows=83180 width=20) Group Key: lzlpartition1.date_created -\u0026gt; Append (cost=0.00..2085.46 rows=85364 width=12) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1_1 (cost=0.00..878.84 rows=45384 width=12) -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_2 (cost=0.00..765.30 rows=39530 width=12) -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_3 (cost=0.00..14.50 rows=450 width=12) -- With wise agg enabled =\u0026gt; set enable_partitionwise_aggregate =on; SET =\u0026gt; explain select date_created,min(id),count(*) from lzlpartition1 group by date_created order by 1,2,3; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Sort (cost=10356.08..10564.32 rows=83296 width=20) Sort Key: lzlpartition1.date_created, (min(lzlpartition1.id)), (count(*)) -\u0026gt; Append (cost=1219.22..3548.31 rows=83296 width=20) -\u0026gt; HashAggregate (cost=1219.22..1663.09 rows=44387 width=20) Group Key: lzlpartition1.date_created -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1 (cost=0.00..878.84 rows=45384 width=12) -\u0026gt; HashAggregate (cost=1061.77..1448.86 rows=38709 width=20) Group Key: lzlpartition1_1.date_created -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_1 (cost=0.00..765.30 rows=39530 width=12) -\u0026gt; HashAggregate (cost=17.88..19.88 rows=200 width=20) Group Key: lzlpartition1_2.date_created -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_2 (cost=0.00..14.50 rows=450 width=12) (12 rows) Without partition wise aggregate: first scan all data then combine (Append), then aggregate (HashAggregate). With partition wise aggregate: first aggregate on each partition (HashAggregate), then combine results (Append).\nPartial Aggregation The aggregation algorithm can be pushed down to partitions for computation. At this point, the aggregated results fall into two categories: non-duplicate aggregation data (GROUP BY includes the partition key), and duplicate aggregation data (GROUP BY does not include the partition key). When aggregation data is non-duplicate, simply appending the per-partition computed aggregation data is sufficient (as in the example above). When per-partition aggregation data has duplicates, an additional aggregation step (Finalize Aggregate) is needed. Aggregation that does not include the partition key is partial aggregation.\nPartial aggregation example:\n-- When GROUP BY is not the partition key =\u0026gt; show enable_partitionwise_aggregate; enable_partitionwise_aggregate -------------------------------- on =\u0026gt; explain select id,count(*) from lzlpartition1 group by id ; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Finalize HashAggregate (cost=2474.80..2573.80 rows=9900 width=12) Group Key: lzlpartition1.id -\u0026gt; Append (cost=1105.76..2377.47 rows=19467 width=12) -\u0026gt; Partial HashAggregate (cost=1105.76..1202.28 rows=9652 width=12) Group Key: lzlpartition1.id -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1 (cost=0.00..878.84 rows=45384 width=4) -\u0026gt; Partial HashAggregate (cost=962.95..1059.10 rows=9615 width=12) Group Key: lzlpartition1_1.id -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_1 (cost=0.00..765.30 rows=39530 width=4) -\u0026gt; Partial HashAggregate (cost=16.75..18.75 rows=200 width=12) Group Key: lzlpartition1_2.id -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_2 (cost=0.00..14.50 rows=450 width=4) When GROUP BY does not include the partition key, aggregation can still be performed, but a subsequent Finalize HashAggregate is required.\nEven without GROUP BY, Partial Aggregate can still occur:\n=\u0026gt; show enable_partitionwise_aggregate; enable_partitionwise_aggregate -------------------------------- on =\u0026gt; explain select count(*) from lzlpartition1; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Finalize Aggregate (cost=1872.10..1872.11 rows=1 width=8) -\u0026gt; Append (cost=992.30..1872.10 rows=3 width=8) -\u0026gt; Partial Aggregate (cost=992.30..992.31 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1 (cost=0.00..878.84 rows=45384 width=0) -\u0026gt; Partial Aggregate (cost=864.12..864.13 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_1 (cost=0.00..765.30 rows=39530 width=0) -\u0026gt; Partial Aggregate (cost=15.62..15.63 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_2 (cost=0.00..14.50 rows=450 width=0) =\u0026gt; explain select max(date_created) from lzlpartition1; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Finalize Aggregate (cost=1872.10..1872.11 rows=1 width=8) -\u0026gt; Append (cost=992.30..1872.10 rows=3 width=8) -\u0026gt; Partial Aggregate (cost=992.30..992.31 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202301 lzlpartition1 (cost=0.00..878.84 rows=45384 width=8) -\u0026gt; Partial Aggregate (cost=864.12..864.13 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202302 lzlpartition1_1 (cost=0.00..765.30 rows=39530 width=8) -\u0026gt; Partial Aggregate (cost=15.62..15.63 rows=1 width=8) -\u0026gt; Seq Scan on lzlpartition1_202304 lzlpartition1_2 (cost=0.00..14.50 rows=450 width=8) The precondition for triggering Partial Aggregate is not GROUP BY. We should think from the purpose of Partial Aggregate — it aims to push aggregation down to partitions. Aggregation without GROUP BY can also be done this way, as shown in the two examples above: they both compute aggregation on each partition first (Partial Aggregate), then combine and aggregate once more (Finalize Aggregate). Without the parameter enabled, these aggregations would occur after scanning all partitions.\nHistory of Partitioned Tables # Declarative partitioning has gone through many version enhancements and is now very mature. Here\u0026rsquo;s a summary of declarative partitioning feature enhancements across PostgreSQL versions:\nPre-PG9.6\nOnly inheritance tables could implement partitioning functionality. PG10\nDeclarative partitioning supported. RANGE and LIST partitioning supported. ATTACH/DETACH table partitions supported. Partition pruning supported. PG11\nAdded HASH partition support. Support for creating primary keys, foreign keys, indexes, and triggers. Support for updating partition key; automatic creation of indexes on partitions. Support for DEFAULT partition. Support for ATTACH index. Support for FOR EACH ROW triggers, automatically created on existing and future child partitions. New enable_partition_pruning parameter; pruning enhancements. Support for partition wise join. Support for partition wise aggregation. PG12\nEnhanced query, insert, pruning, and COPY performance. Support for foreign key constraints referencing partitioned tables. Support for non-blocking partition ATTACH: ALTER TABLE ATTACH PARTITION. PG13\nEnhanced pruning. Enhanced partition wise join. Support for BEFORE triggers. Support for publishing partitioned tables; support for subscribing and writing to partitioned tables. PG14\nEnhanced UPDATE and DELETE performance. Support for non-blocking partition DETACH: ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY. Support for REINDEX on partitioned table indexes. PG15\nEnhanced execution plan generation, reducing generation time with many partitions. Enhanced sorting. Support for CLUSTER on partitioned tables. PG16\nEnhanced GENERATED column restrictions: if the parent table has a generated column, child partitions must also include it. Enhanced lookup for RANGE and LIST partitions. References # 《PostgreSQL修炼之道》\nhttps://mp.weixin.qq.com/s/NW8XOZNq0YlDZvx24H737Q https://www.postgresql.org/docs/current/ddl-partitioning.html https://www.postgresql.org/docs/current/ddl-inherit.html https://www.postgresql.org/docs/13/sql-altertable.html https://github.com/postgrespro/pg_pathman https://developer.aliyun.com/article/62314 https://hevodata.com/learn/postgresql-partitions https://www.postgresql.fastware.com/postgresql-insider-prt-ove https://www.buckenhofer.com/2021/01/postgresql-partitioning-guide/ https://www.depesz.com/2018/05/01/waiting-for-postgresql-11-support-partition-pruning-at-execution-time/ https://blog.csdn.net/horses/article/details/86164273\nhttp://www.pgsql.tech/article_0_10000102\nhttps://brandur.org/fragments/postgres-partitioning-2022\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/postgresql-table-partitioning-deep-dive/","section":"Posts","summary":"What is a Partitioned Table # Database partitioning splits table data into smaller physical shards to improve performance, availability, and manageability. Partitioned tables are a common optimization technique for large tables in relational databases. DBMS generally provide partition management, and applications can access partitioned tables directly without changing their architecture—though good performance requires proper partition access patterns.\nPartitioned tables are common database technology, but PostgreSQL partitioned tables have many unique characteristics: multiple implementation approaches, partitions being regular tables, partition maintenance strategies, SQL optimization considerations, and some known issues.\n","title":"PostgreSQL Table Partitioning Deep Dive","type":"posts"},{"content":"","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/categories/postgresql%E9%9D%A2%E8%AF%95%E9%A2%98/","section":"Categories","summary":"","title":"PostgreSQL面试题","type":"categories"},{"content":"I\u0026rsquo;ve already written a fairly detailed article about logical replication before, so I won\u0026rsquo;t repeat the basics here. However, some knowledge points inevitably get missed. Recently I\u0026rsquo;ve discovered some interesting logical replication features.\nreplica identity and old/new values # replica identity is used to identify a row during logical replication. The above statement is certainly correct, but it doesn\u0026rsquo;t explain the changes in old and new data.\nDEFAULT Records the old values of the columns of the primary key, if any. This is the default for non-system tables. USING INDEX index_name Records the old values of the columns covered by the named index, that must be unique, not partial, not deferrable, and include only columns marked NOT NULL. If this index is dropped, the behavior is the same as NOTHING. FULL Records the old values of all columns in the row. NOTHING Records no information about the old row. This is the default for system tables.\nThe PG official documentation only explains the old value situation for replica identity — for example, it doesn\u0026rsquo;t even mention that NOTHING won\u0026rsquo;t replicate update/delete. This shows the importance of old values.\nCreating a replication link:\nselect pg_create_logical_replication_slot(\u0026#39;pubtestlzl2\u0026#39;,\u0026#39;test_decoding\u0026#39;); pg_recvlogical -d lzldb --slot=pubtestlzl2 --start -f recv.sql \u0026amp; Normal test_decoding replication link simulation:\n--replica identity defaults to d: uses primary key when available; without primary key, defaults to nothing, unable to replicate update and delete M=\u0026gt; create table lzltest(a bigint primary key,b varchar(100),c varchar(100)); CREATE TABLE M=\u0026gt; insert into lzltest values(1,\u0026#39;bbbbbb\u0026#39;,\u0026#39;ccccccccc\u0026#39;); INSERT 0 1 M=\u0026gt; update lzltest set b=\u0026#39;b\u0026#39;; UPDATE 1 recvlogical output:\ntable public.lzltest: INSERT: a[bigint]:1 b[character varying]:\u0026#39;bbbbbb\u0026#39; c[character varying]:\u0026#39;ccccccccc\u0026#39; table public.lzltest: UPDATE: a[bigint]:1 b[character varying]:\u0026#39;b\u0026#39; c[character varying]:\u0026#39;ccccccccc\u0026#39; With replica identity as default, updating a non-primary-key field — all fields have only new values.\nM=\u0026gt; update lzltest set a=\u0026#39;111\u0026#39;; UPDATE 1 table public.lzltest: UPDATE: old-key: a[bigint]:1 new-tuple: a[bigint]:111 b[character varying]:\u0026#39;bb\u0026#39; c[character varying]:\u0026#39;ccccccccc\u0026#39; With replica identity as default, updating the primary key — the identity column\u0026rsquo;s old and new values are decoded; other fields only have new values.\nM=\u0026gt; alter table lzltest replica identity full; ALTER TABLE M=\u0026gt; update lzltest set b=\u0026#39;b\u0026#39;; UPDATE 1 table public.lzltest: UPDATE: old-key: a[bigint]:2 b[character varying]:\u0026#39;b\u0026#39; c[character varying]:\u0026#39;ccccccccc\u0026#39; new-tuple: a[bigint]:2 b[character varying]:\u0026#39;b\u0026#39; c[character varying]:\u0026#39;ccccccccc\u0026#39; With replica identity set to full, both old and new values for the entire row are preserved.\nWhether in default (primary key) or full mode, all column information is recorded. The difference lies in whether old data is present. In default mode:\ninsert: inherently new data, so naturally no old values — all column new values are recorded. update: records new values for all columns; only the identity column has old values (if the identity column was updated). delete: inherently old data, but not all columns are necessarily recorded. The same rule applies: only the identity column has old values — only the identity column is recorded. Summary: When replica identity is default, regardless of the operation (INSERT, UPDATE, DELETE), as long as it\u0026rsquo;s old data, only the identity column is recorded; as long as it\u0026rsquo;s new data, all columns are recorded.\nWhen changing from default to full, the decoded log volume difference isn\u0026rsquo;t particularly large, because new data always includes all columns. (Excluding scenarios that are entirely deletes) the log volume decoded under full is less than twice that of default.\npgoutput cannot be peeked # Create a replication slot using pgoutput:\nselect pg_create_logical_replication_slot(\u0026#39;pubtestlzl\u0026#39;,\u0026#39;pgoutput\u0026#39;); Then try to peek or receive — both fail:\nselect * from pg_logical_slot_peek_changes(\u0026#39;pubtestlzl\u0026#39;,null,null); pg_recvlogical -d lzldb --slot=pubtestlzl --start -f recv.sql \u0026amp; pg_recvlogical: error: could not send replication command \u0026#34;START_REPLICATION SLOT \u0026#34;pubtestlzl\u0026#34; LOGICAL 0/0\u0026#34;: ERROR: client sent proto_version=0 but we only support protocol 1 or higher CONTEXT: slot \u0026#34;pubtestlzl\u0026#34;, output plugin \u0026#34;pgoutput\u0026#34;, in the startup callback pg_recvlogical: disconnected; waiting 5 seconds to try again You cannot peek or use pg_recvlogical to receive from a pgoutput replication slot. Since pgoutput is the output plugin for publish-subscribe, this plugin cannot be manually peeked or received\u0026hellip;\nPublish-Subscribe Doesn\u0026rsquo;t Have to Be PG-to-PG # create publication and create subscription are PG internal commands that can also be used to create links between PG databases. Third-party software can similarly use create publication and simulate subscriptions to create replication slots. This is better than directly creating replication slots because publications can manage replicated tables.\nTOAST and Logical Decoding # TOAST columns being sent are NOT decoded! This means an entire row of data may only have part of it transmitted (when TOAST columns themselves haven\u0026rsquo;t been updated).\nNormal decoding decodes all columns:\n--Create a test_decoding replication slot =\u0026gt; select pg_create_logical_replication_slot(\u0026#39;logical_dest\u0026#39;,\u0026#39;test_decoding\u0026#39;); pg_create_logical_replication_slot ------------------------------------ (logical_dest,349/A80040E0) (1 row) --Create a table with small columns =\u0026gt; create table test1(a int primary key,b varchar(100),c varchar(100)); CREATE TABLE =\u0026gt; select* from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size --------------+---------------+-----------+--------+----------+-----------+--------+------------+--------+--------------+--------------+---------------------+------------+--------------- logical_dest | test_decoding | logical | 418679 | lzldb | f | f | [null] | [null] | 872483335 | 349/A80040A8 | 349/A80040E0 | reserved | [null] (1 row) =\u0026gt; insert into test1 values (1,\u0026#39;qwer\u0026#39;,\u0026#39;qwer\u0026#39;); INSERT 0 1 Time: 0.915 ms =\u0026gt; select * from pg_logical_slot_peek_changes(\u0026#39;logical_dest\u0026#39;,null,null); lsn | xid | data --------------+-----------+-------------------------------------------------------------------------------------------------- 349/A8004C78 | 872483335 | BEGIN 872483335 349/A80103E8 | 872483335 | COMMIT 872483335 349/A8018B30 | 872483369 | BEGIN 872483369 349/A8018B30 | 872483369 | table public.test1: INSERT: a[integer]:1 b[character varying]:\u0026#39;qwer\u0026#39; c[character varying]:\u0026#39;qwer\u0026#39; 349/A8018C50 | 872483369 | COMMIT 872483369 (5 rows) --insert is decoded, containing all columns =\u0026gt; update test1 set b=\u0026#39;zxcv\u0026#39; where c=\u0026#39;qwer\u0026#39;; UPDATE 1 Time: 4.005 ms =\u0026gt; select * from pg_logical_slot_peek_changes(\u0026#39;logical_dest\u0026#39;,null,null); lsn | xid | data --------------+-----------+-------------------------------------------------------------------------------------------------- 349/A8004C78 | 872483335 | BEGIN 872483335 349/A80103E8 | 872483335 | COMMIT 872483335 349/A8018B30 | 872483369 | BEGIN 872483369 349/A8018B30 | 872483369 | table public.test1: INSERT: a[integer]:1 b[character varying]:\u0026#39;qwer\u0026#39; c[character varying]:\u0026#39;qwer\u0026#39; 349/A8018C50 | 872483369 | COMMIT 872483369 349/A801D018 | 872483378 | BEGIN 872483378 349/A801D018 | 872483378 | table public.test1: UPDATE: a[integer]:1 b[character varying]:\u0026#39;zxcv\u0026#39; c[character varying]:\u0026#39;qwer\u0026#39; 349/A801D098 | 872483378 | COMMIT 872483378 (8 rows) --update is decoded, containing all columns Normally, without TOAST, decoded data includes all columns of the row.\nTOAST decoding test:\n--Enlarge the columns =\u0026gt; alter table test1 alter column b type varchar(3000); ALTER TABLE Time: 8.091 ms =\u0026gt; alter table test1 alter column c type varchar(3000); ALTER TABLE Time: 0.937 ms --A batch random function =\u0026gt; create or replace function f_random_str(length INTEGER) returns character varying -\u0026gt; LANGUAGE plpgsql -\u0026gt; AS $$ -\u0026gt; DECLARE -\u0026gt; result varchar(3000); -\u0026gt; BEGIN -\u0026gt; SELECT array_to_string(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer) -\u0026gt; FROM generate_series(1,length)), \u0026#39;\u0026#39;) INTO result; -\u0026gt; return result; -\u0026gt; END -\u0026gt; $$; CREATE FUNCTION --Insert data =\u0026gt; insert into test1 values (2,f_random_str(2000),f_random_str(2000)); INSERT 0 1 --Check for TOAST =\u0026gt; SELECT -\u0026gt; n.nspname as schema, -\u0026gt; s.oid::regclass as relname, -\u0026gt; s.reltoastrelid::regclass as toast_name, -\u0026gt; pg_relation_size(s.reltoastrelid) AS toast_size -\u0026gt; FROM -\u0026gt; pg_class s join pg_namespace n -\u0026gt; on s.relnamespace = n.oid -\u0026gt; WHERE -\u0026gt; relkind = \u0026#39;r\u0026#39; -\u0026gt; AND reltoastrelid \u0026lt;\u0026gt; 0 -\u0026gt; AND n.nspname = \u0026#39;public\u0026#39; -\u0026gt; ORDER BY -\u0026gt; 3 DESC; schema | relname | toast_name | toast_size --------+---------+--------------------------+------------ public | test1 | pg_toast.pg_toast_418714 | 8192 (1 row) --Update via primary key, updating a TOAST column =\u0026gt; update test1 set b=\u0026#39;zxcv\u0026#39; where a=2; UPDATE 1 =\u0026gt; select * from pg_logical_slot_peek_changes(\u0026#39;logical_dest\u0026#39;,null,null); lsn | xid | --------------+-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ... 349/A851FD90 | 872483420 | BEGIN 872483420 349/A85216E0 | 872483420 | table public.test1: INSERT: a[integer]:2 b[character varying]:\u0026#39;GIORCXQQWDBGTUNDZXAWMPYOUEGTECWTVQGDQGSPMEPJNPUQIFMESLRASBZWGONETRENDCHLDWVTDWJLTGRYUMFDOWHLEYLUTECPOVCYXFIATLKVEQTHSC\u0026#39; 349/A85218A0 | 872483420 | COMMIT 872483420 349/A8525CA8 | 872483429 | BEGIN 872483429 349/A8525D50 | 872483429 | table public.test1: UPDATE: a[integer]:2 b[character varying]:\u0026#39;zxcv\u0026#39; c[character varying]:unchanged-toast-datum 349/A8525DE0 | 872483429 | COMMIT 872483429 Column c, which has TOAST and was not involved in the update, has no decoded data — it directly outputs toast datum unchanged: unchanged-toast-datum.\nTesting with wal2json:\n=\u0026gt; select pg_create_logical_replication_slot(\u0026#39;logical_json\u0026#39;,\u0026#39;wal2json\u0026#39;); pg_create_logical_replication_slot ------------------------------------ (logical_json,349/A87CAB58) (1 row) =\u0026gt; update test1 set b=\u0026#39;zxcv\u0026#39; where a=2; UPDATE 1 =\u0026gt; \\pset format wrapped Output format is wrapped. =\u0026gt; \\pset columns 200 Target width is 200. =\u0026gt; select * from pg_logical_slot_peek_changes(\u0026#39;logical_json\u0026#39;,null,null); -[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ lsn | 349/A87CACF8 xid | 872483495 data | {\u0026#34;change\u0026#34;:[{\u0026#34;kind\u0026#34;:\u0026#34;update\u0026#34;,\u0026#34;schema\u0026#34;:\u0026#34;public\u0026#34;,\u0026#34;table\u0026#34;:\u0026#34;test1\u0026#34;,\u0026#34;columnnames\u0026#34;:[\u0026#34;a\u0026#34;,\u0026#34;b\u0026#34;],\u0026#34;columntypes\u0026#34;:[\u0026#34;integer\u0026#34;,\u0026#34;character varying(3000)\u0026#34;],\u0026#34;columnvalues\u0026#34;:[2,\u0026#34;zxcv\u0026#34;],\u0026#34;oldkeys\u0026#34;:{\u0026#34;keynames\u0026#34;:[\u0026#34;a\u0026#34;],. |.\u0026#34;keytypes\u0026#34;:[\u0026#34;integer\u0026#34;],\u0026#34;keyvalues\u0026#34;:[2]}}]} =\u0026gt; update test1 set b=\u0026#39;zxcv\u0026#39; where a=1; UPDATE 1 Time: 1.391 ms =\u0026gt; select * from pg_logical_slot_peek_changes(\u0026#39;logical_json\u0026#39;,null,null); -[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ lsn | 349/A87CACF8 xid | 872483495 data | {\u0026#34;change\u0026#34;:[{\u0026#34;kind\u0026#34;:\u0026#34;update\u0026#34;,\u0026#34;schema\u0026#34;:\u0026#34;public\u0026#34;,\u0026#34;table\u0026#34;:\u0026#34;test1\u0026#34;,\u0026#34;columnnames\u0026#34;:[\u0026#34;a\u0026#34;,\u0026#34;b\u0026#34;],\u0026#34;columntypes\u0026#34;:[\u0026#34;integer\u0026#34;,\u0026#34;character varying(3000)\u0026#34;],\u0026#34;columnvalues\u0026#34;:[2,\u0026#34;zxcv\u0026#34;],\u0026#34;oldkeys\u0026#34;:{\u0026#34;keynames\u0026#34;:[\u0026#34;a\u0026#34;],. |.\u0026#34;keytypes\u0026#34;:[\u0026#34;integer\u0026#34;],\u0026#34;keyvalues\u0026#34;:[2]}}]} -[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ lsn | 349/A8CCA0D8 xid | 872483509 data | {\u0026#34;change\u0026#34;:[{\u0026#34;kind\u0026#34;:\u0026#34;update\u0026#34;,\u0026#34;schema\u0026#34;:\u0026#34;public\u0026#34;,\u0026#34;table\u0026#34;:\u0026#34;test1\u0026#34;,\u0026#34;columnnames\u0026#34;:[\u0026#34;a\u0026#34;,\u0026#34;b\u0026#34;,\u0026#34;c\u0026#34;],\u0026#34;columntypes\u0026#34;:[\u0026#34;integer\u0026#34;,\u0026#34;character varying(3000)\u0026#34;,\u0026#34;character varying(3000)\u0026#34;],\u0026#34;columnvalues\u0026#34;:[1,\u0026#34;zxcv\u0026#34;. |.,\u0026#34;qwer\u0026#34;],\u0026#34;oldkeys\u0026#34;:{\u0026#34;keynames\u0026#34;:[\u0026#34;a\u0026#34;],\u0026#34;keytypes\u0026#34;:[\u0026#34;integer\u0026#34;],\u0026#34;keyvalues\u0026#34;:[1]}}]} --When updating, column c data is not decoded wal2json shows the same behavior.\nMySQL\u0026rsquo;s binlog_row_image parameter can adjust whether binlog records large fields:\nfull (Log all columns) minimal (Log only changed columns, and columns needed to identify rows) noblob (Log all columns, except for unneeded BLOB and TEXT columns) PG has absolutely no such control — by default, TOAST columns are not decoded, and there are no other options to configure~\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/some-features-of-postgresql-logical-replication/","section":"Posts","summary":"I’ve already written a fairly detailed article about logical replication before, so I won’t repeat the basics here. However, some knowledge points inevitably get missed. Recently I’ve discovered some interesting logical replication features.\nreplica identity and old/new values # replica identity is used to identify a row during logical replication. The above statement is certainly correct, but it doesn’t explain the changes in old and new data.\nDEFAULT Records the old values of the columns of the primary key, if any. This is the default for non-system tables. USING INDEX index_name Records the old values of the columns covered by the named index, that must be unique, not partial, not deferrable, and include only columns marked NOT NULL. If this index is dropped, the behavior is the same as NOTHING. FULL Records the old values of all columns in the row. NOTHING Records no information about the old row. This is the default for system tables.\n","title":"Some Features of PostgreSQL Logical Replication","type":"posts"},{"content":" Problem: The Queried Table Did Not Appear in the Execution Plan # SQL:\nSELECT * FROM ( SELECT A.column1 as \u0026#34;column1\u0026#34;, -- many A columns omitted in between A.column99 as \u0026#34;column99\u0026#34; from table_a A left join ( SELECT lzl_id from table_a AA inner join table_b BB ON AA.lzl_key = BB.lzl_id where AA.column_code = \u0026#39;1\u0026#39; GROUP BY lzl_id ) B ON B.lzl_id = A.lzl_key where A.flagflagflag = \u0026#39;1\u0026#39; AND A.typetypetype = \u0026#39;2\u0026#39; ) TEMP limit 100 offset 1000 Execution plan:\nQUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=2.84..5.68 rows=1 width=1105) (actual time=0.038..0.039 rows=0 loops=1) Buffers: shared hit=2 -\u0026gt; Seq Scan on table_a a (cost=0.00..2.84 rows=1 width=1105) (actual time=0.036..0.037 rows=0 loops=1) Filter: (((flagflagflag)::text = \u0026#39;1\u0026#39;::text) AND ((typetypetype)::text = \u0026#39;2\u0026#39;::text)) Rows Removed by Filter: 38 Buffers: shared hit=2 Planning Time: 0.184 ms Execution Time: 0.066 ms As you can see, the SQL itself is fairly complex. Logically, the SQL queries 3 tables / accesses 2 tables total. I can understand table_a appearing in the execution plan, but table_b, which needed to be queried, wasn\u0026rsquo;t in the execution plan at all! The execution plan was simply a sequential scan of table_a.\nThe Analytical Journey # In the middle of the analysis, I actually considered many possibilities, but the most likely one was logical optimization — that is, the PostgreSQL optimizer determined that table_b didn\u0026rsquo;t need to be queried.\nObserving the SQL, I noticed that the final query only selected columns from table_a, without any columns from table_b. Adding any column from the intermediate table B made the SQL execution plan appear \u0026ldquo;normal\u0026rdquo; — it accessed table_b:\nexplain SELECT * FROM ( SELECT A.column1 as \u0026#34;column1\u0026#34;, -- many A columns omitted in between A.column99 as \u0026#34;column99\u0026#34;, B.lzl_id -- added a column from intermediate table B from table_a A left join ( SELECT lzl_id from table_a AA inner join table_b BB ON AA.lzl_key = BB.lzl_id where AA.column_code = \u0026#39;1\u0026#39; GROUP BY lzl_id ) B ON B.lzl_id = A.lzl_key where A.flagflagflag = \u0026#39;1\u0026#39; AND A.typetypetype = \u0026#39;2\u0026#39; ) TEMP limit 100 offset 1000 --------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=14.69..17.67 rows=1 width=1113) -\u0026gt; Nested Loop Left Join (cost=11.72..14.69 rows=1 width=1113) Join Filter: (bb.lzl_id = a.lzl_key) -\u0026gt; Seq Scan on table_a a (cost=0.00..2.84 rows=1 width=1113) Filter: (((flagflagflag)::text = \u0026#39;1\u0026#39;::text) AND ((typetypetype)::text = \u0026#39;2\u0026#39;::text)) -\u0026gt; Group (cost=11.72..11.74 rows=5 width=8) Group Key: bb.lzl_id -\u0026gt; Sort (cost=11.72..11.73 rows=5 width=8) Sort Key: bb.lzl_id -\u0026gt; Nested Loop (cost=0.15..11.66 rows=5 width=8) -\u0026gt; Seq Scan on table_a aa (cost=0.00..2.70 rows=1 width=8) Filter: ((company_code)::text = \u0026#39;1\u0026#39;::text) -\u0026gt; Index Only Scan using idx_table_b_lzl_id on table_b bb (cost=0.15..8.83 rows=13 width=8) Index Cond: (lzl_id = aa.lzl_key) This seems related to LEFT JOIN, but a quick thought makes it seem incorrect — after all, the results from the right table should affect the final query result, so the right table shouldn\u0026rsquo;t be skipped. Let\u0026rsquo;s try a simple LEFT JOIN:\nexplain select lzlleft.a from lzlleft left join lzlright on lzlleft.a=lzlright.a; QUERY PLAN -------------------------------------------------------------------- Hash Left Join (cost=1.04..15.47 rows=320 width=4) Hash Cond: (lzlleft.a = lzlright.a) -\u0026gt; Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4) -\u0026gt; Hash (cost=1.02..1.02 rows=2 width=4) -\u0026gt; Seq Scan on lzlright (cost=0.00..1.02 rows=2 width=4) The right table is scanned. But, in intermediate table B, there\u0026rsquo;s the keyword GROUP BY. If we remove GROUP BY, then table_b is accessed regardless of whether we query columns from B.\nLet\u0026rsquo;s add a GROUP BY in our test table and see the result:\n\u0026gt; select * from lzlleft; a | b ---+----- 1 | zzz (1 row) Time: 0.259 ms \u0026gt; select * from lzlright; a | b ---+------- 1 | qwer 1 | poiuy \u0026gt; select lzlright.b from lzlleft full join lzlright on lzlleft.b=lzlright.b group by lzlright.b; b -------- [null] poiuy qwer (3 rows) This is where I realized that the result set from GROUP BY must have a certain property — uniqueness.\nLet\u0026rsquo;s add GROUP BY in the test table:\nexplain select lzlleft.a from lzlleft left join (select a from lzlright group by a) c on lzlleft.a=c.a; QUERY PLAN ---------------------------------------------------------- Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4) The right table is not queried!\nBased on the principle of right-table uniqueness, we can also have some fun variations:\n-- distinct ensures right-table uniqueness \u0026gt; explain select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a; QUERY PLAN ---------------------------------------------------------- Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4) -- unique index ensures right-table uniqueness, even with just select a from lzlright \u0026gt; explain select lzlleft.a from lzlleft left join (select a from lzlright) c on lzlleft.a=c.a; QUERY PLAN ----------------------------------------------------------------------- Hash Left Join (cost=17.20..49.12 rows=512 width=4) Hash Cond: (lzlleft.a = lzlright.a) -\u0026gt; Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4) -\u0026gt; Hash (cost=13.20..13.20 rows=320 width=4) -\u0026gt; Seq Scan on lzlright (cost=0.00..13.20 rows=320 width=4) (5 rows) Time: 0.510 ms \u0026gt; create unique index idx_right on lzlright(a); CREATE INDEX Time: 3.576 ms \u0026gt; explain select lzlleft.a from lzlleft left join (select a from lzlright) c on lzlleft.a=c.a; QUERY PLAN ---------------------------------------------------------- Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4) (1 row) Here\u0026rsquo;s a summary of the analysis: when the right table\u0026rsquo;s data is unique and only the left table\u0026rsquo;s data is being queried, there\u0026rsquo;s no need to actually access the right table. So this is not a bug, but a feature of the PostgreSQL optimizer — and it makes logical sense.\nSource Code Analysis # No source code analysis this time~\nThe optimizer source code is just too difficult. I only looked at some optimizer source code comments. Search for the keyword unique-ify, and you\u0026rsquo;ll find this:\n* Also, this routine and others in this module accept the special JoinTypes * JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER to indicate that we should * unique-ify the outer or inner relation and then apply a regular inner * join. These values are not allowed to propagate outside this module, * however. Path cost estimation code may need to recognize that it\u0026#39;s * dealing with such a case --- the combination of nominal jointype INNER * with sjinfo-\u0026gt;jointype == JOIN_SEMI indicates that. Special JoinTypes: JOIN_UNIQUE_INNER and JOIN_UNIQUE_OUTER — they try to unique-ify the outer and inner relations and then treat them as an inner join. Path cost estimation needs to consider this scenario.\nComparison with Oracle and MySQL Optimizers # Let\u0026rsquo;s compare whether Oracle and MySQL optimizers have similar logical optimization improvements.\n-- Oracle create table lzlleft(a number); create table lzlright(a number); select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a; -- GROUP BY uniqueness SQL\u0026gt; select lzlleft.a from lzlleft left join (select a from lzlright group by a) c on lzlleft.a=c.a; no rows selected Execution Plan ---------------------------------------------------------- Plan hash value: 3533354041 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 26 | 5 (20)| 00:00:01 | |* 1 | HASH JOIN OUTER | | 1 | 26 | 5 (20)| 00:00:01 | | 2 | TABLE ACCESS FULL | LZLLEFT | 1 | 13 | 2 (0)| 00:00:01 | | 3 | VIEW | | 1 | 13 | 3 (34)| 00:00:01 | | 4 | HASH GROUP BY | | 1 | 13 | 3 (34)| 00:00:01 | | 5 | TABLE ACCESS FULL| LZLRIGHT | 1 | 13 | 2 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access(\u0026#34;LZLLEFT\u0026#34;.\u0026#34;A\u0026#34;=\u0026#34;C\u0026#34;.\u0026#34;A\u0026#34;(+)) -- DISTINCT uniqueness SQL\u0026gt; select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a; no rows selected Execution Plan ---------------------------------------------------------- Plan hash value: 3859658234 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 26 | 5 (20)| 00:00:01 | |* 1 | HASH JOIN OUTER | | 1 | 26 | 5 (20)| 00:00:01 | | 2 | TABLE ACCESS FULL | LZLLEFT | 1 | 13 | 2 (0)| 00:00:01 | | 3 | VIEW | | 1 | 13 | 3 (34)| 00:00:01 | | 4 | HASH UNIQUE | | 1 | 13 | 3 (34)| 00:00:01 | | 5 | TABLE ACCESS FULL| LZLRIGHT | 1 | 13 | 2 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access(\u0026#34;LZLLEFT\u0026#34;.\u0026#34;A\u0026#34;=\u0026#34;C\u0026#34;.\u0026#34;A\u0026#34;(+)) -- MySQL create table lzlleft(a int primary key); create table lzlright(a int primary key); -- GROUP BY uniqueness explain select lzlleft.a from lzlleft left join (select a from lzlright group by a) c on lzlleft.a=c.a; +----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+ | 1 | PRIMARY | lzlleft | NULL | index | NULL | PRIMARY | 4 | NULL | 1 | 100.00 | Using index | | 1 | PRIMARY | \u0026lt;derived2\u0026gt; | NULL | ref | \u0026lt;auto_key0\u0026gt; | \u0026lt;auto_key0\u0026gt; | 4 | lzldb.lzlleft.a | 2 | 100.00 | Using index | | 2 | DERIVED | lzlright | NULL | index | PRIMARY | PRIMARY | 4 | NULL | 1 | 100.00 | Using index | +----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+ -- DISTINCT uniqueness explain select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a; +----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+ | 1 | PRIMARY | lzlleft | NULL | index | NULL | PRIMARY | 4 | NULL | 1 | 100.00 | Using index | | 1 | PRIMARY | \u0026lt;derived2\u0026gt; | NULL | ref | \u0026lt;auto_key0\u0026gt; | \u0026lt;auto_key0\u0026gt; | 4 | lzldb.lzlleft.a | 2 | 100.00 | Using index | | 2 | DERIVED | lzlright | NULL | index | PRIMARY | PRIMARY | 4 | NULL | 1 | 100.00 | Using index | +----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+ In summary, neither Oracle nor MySQL performs the optimization of eliminating the right table in a LEFT JOIN when only left-table columns are queried and the right table is unique — they both access the right table.\nThe PostgreSQL optimizer really has some impressive tricks.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/the-table-i-wanted-to-query-was-not-in-the-execution-plan/","section":"Posts","summary":"Problem: The Queried Table Did Not Appear in the Execution Plan # SQL:\nSELECT * FROM ( SELECT A.column1 as \"column1\", -- many A columns omitted in between A.column99 as \"column99\" from table_a A left join ( SELECT lzl_id from table_a AA inner join table_b BB ON AA.lzl_key = BB.lzl_id where AA.column_code = '1' GROUP BY lzl_id ) B ON B.lzl_id = A.lzl_key where A.flagflagflag = '1' AND A.typetypetype = '2' ) TEMP limit 100 offset 1000 Execution plan:\n","title":"The Table I Wanted to Query Was Not in the Execution Plan","type":"posts"},{"content":" Problem Description # PostgreSQL UPDATE statement throws error: too many range table entries\nOriginal SQL:\nwith t as (select id from LZLTAB where id=8723 limit 100 ) update\tLZLTAB set STATUS = \u0026#39;00\u0026#39;, FILE_ID = null, DATE_UPDATED = localtimestamp(0) where id in (select\tid from t) If we rewrite UPDATE as SELECT, it succeeds:\nwith t as (select\tid from\tLZLTAB where\tid=8723 limit 100 ) select * from LZLTAB where\tid in (select id\tfrom t) id | date_created ------+----------------------------+... 8723 | 2023-06-21 18:02:21.161687 (1 row)\tPrimary key and partitions — 400 partitions total:\nPartition key: RANGE (partition_key) Indexes: \u0026#34;pk_lzl\u0026#34; PRIMARY KEY, btree (id, partition_key) ... Partitions: lzl_p20230601 FOR VALUES FROM (\u0026#39;20230601\u0026#39;) TO (\u0026#39;20230602\u0026#39;), lzl_p20230602 FOR VALUES FROM (\u0026#39;20230602\u0026#39;) TO (\u0026#39;20230603\u0026#39;), lzl_p20230603 FOR VALUES FROM (\u0026#39;20230603\u0026#39;) TO (\u0026#39;20230604\u0026#39;) The SQL logic has many optimization opportunities, but we won\u0026rsquo;t discuss those here. The focus is on why UPDATE fails and why SELECT and UPDATE behave differently.\nEXPLAIN UPDATE throws this error:\nexplain with t as (selec tid from LZLTAB where id=8723 limit 100 ) update LZLTAB set STATUS = \u0026#39;00\u0026#39;, FILE_ID = null, DATE_UPDATED = localtimestamp(0) where id in (select id from t); ERROR: 54000: too many range table entries LOCATION: add_rte_to_flat_rtable, setrefs.c:451 Time: 18341.171 ms (00:18.341) EXPLAIN took 18 seconds, then threw the error.\nSource Code Analysis # The error directly points to the source location: LOCATION: add_rte_to_flat_rtable, setrefs.c:451\nFind the source at src/backend/optimizer/plan/setrefs.c.\nThe comment explains that setrefs.c handles post-processing of a completed plan tree:\n/* *Post-processing of a completed plan tree: fix references to subplan *\tvars, compute regproc values for operators, etc */ Find the function at line 451:\n/* * Add (a copy of) the given RTE to the final rangetable * * In the flat rangetable, we zero out substructure pointers that are not * needed by the executor; this reduces the storage space and copying cost * for cached plans. We keep only the ctename, alias and eref Alias fields, * which are needed by EXPLAIN, and the selectedCols, insertedCols, * updatedCols, and extraUpdatedCols bitmaps, which are needed for * executor-startup permissions checking and for trigger event checking. */ static void add_rte_to_flat_rtable(PlannerGlobal *glob, RangeTblEntry *rte) { ... /* * Check for RT index overflow; it\u0026#39;s very unlikely, but if it did happen, * the executor would get confused by varnos that match the special varno * values. */ if (IS_SPECIAL_VARNO(list_length(glob-\u0026gt;finalrtable))) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg(\u0026#34;too many range table entries\u0026#34;))); ... } errmsg() is at line 451. From the comments, add_rte_to_flat_rtable() is related to RTE. What is RTE? We\u0026rsquo;ll analyze below.\nThe error check uses IS_SPECIAL_VARNO(). Searching for this macro in src/include/nodes/primnodes.h:\n/* * Var - expression node representing a variable (ie, a table column) * * In the parser and planner, varno and varattno identify the semantic * referent, which is a base-relation column unless the reference is to a join * USING column that isn\u0026#39;t semantically equivalent to either join input column * (because it is a FULL join or the input column requires a type coercion). * In those cases varno and varattno refer to the JOIN RTE. (Early in the * planner, we replace such join references by the implied expression; but up * till then we want join reference Vars to keep their original identity for * query-printing purposes.) ... */ #define INNER_VAR\t65000\t/* reference to inner subplan */ #define OUTER_VAR\t65001\t/* reference to outer subplan */ #define INDEX_VAR\t65002\t/* reference to index column */ #define IS_SPECIAL_VARNO(varno)\t((varno) \u0026gt;= INNER_VAR) The comment above is a bit dense, but one phrase is key: In those cases varno and varattno refer to the JOIN RTE. varno is related to RTE.\nWhen varno\u0026gt;=65000, the error is thrown. (We won\u0026rsquo;t go into the differences between INNER_VAR, OUTER_VAR, and INDEX_VAR here since their values are close and don\u0026rsquo;t affect the analysis.)\nWhat is RTE?\nDescriptions of RTE (rangetable or RangeTblEntry) can be found throughout the execution plan source code, and the error is clear: ERROR: 54000: too many range table entries — it\u0026rsquo;s about RTE. So what is RTE?\nIn src/include/nodes/parsenodes.h, there\u0026rsquo;s a description of RTE:\n/*-------------------- * RangeTblEntry - *\tA range table is a List of RangeTblEntry nodes. * *\tA range table entry may represent a plain relation, a sub-select in *\tFROM, or the result of a JOIN clause. (Only explicit JOIN syntax *\tproduces an RTE, not the implicit join resulting from multiple FROM *\titems. This is because we only need the RTE to deal with SQL features *\tlike outer joins and join-output-column aliasing.) Other special *\tRTE types also exist, as indicated by RTEKind. * *\tNote that we consider RTE_RELATION to cover anything that has a pg_class *\tentry. relkind distinguishes the sub-cases. */ Simply put, an RTE is a \u0026ldquo;table\u0026rdquo; in the execution plan — it can be a concrete table or a generated \u0026ldquo;table\u0026rdquo; like a subquery, join result, etc. The RTE limit of 65000 means too many RTEs were generated in the execution plan.\nViewing the UPDATE Execution Plan # Since we now know what RTE is, looking at the SQL execution plan may help. But since the original SQL (400 partitions) couldn\u0026rsquo;t generate an execution plan, let\u0026rsquo;s create a 30-partition table and hopefully EXPLAIN it to observe the plan.\n30-partition table with the same UPDATE statement:\nexplain with t as (select id from lzl where id=8723 limit 100 ) update lzl set STATUS = \u0026#39;00\u0026#39;, FILE_ID = null, DATE_UPDATED = localtimestamp(0) where id in ( select id from t); Generated execution plan:\nQUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Update on lzl (cost=155.48..4980.00 rows=600 width=3042) Update on lzl_p20230601 lzl_1 Update on lzl_p20230602 lzl_2 ... Update on lzl_p20230630 lzl_30 -\u0026gt; Hash Semi Join (cost=155.48..166.00 rows=20 width=3042) Hash Cond: (lzl_1.id = t.id) -\u0026gt; Seq Scan on lzl_p20230601 lzl_1 (cost=0.00..10.20 rows=20 width=2912) -\u0026gt; Hash (cost=155.10..155.10 rows=30 width=40) -\u0026gt; Subquery Scan on t (cost=0.14..155.10 rows=30 width=40) -\u0026gt; Limit (cost=0.14..154.80 rows=30 width=8) -\u0026gt; Append (cost=0.14..154.80 rows=30 width=8) -\u0026gt; Index Only Scan using lzl_p20230601_pkey on lzl_p20230601 lzl_32 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) -\u0026gt; Index Only Scan using lzl_p20230602_pkey on lzl_p20230602 lzl_33 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) ... -\u0026gt; Index Only Scan using lzl_p20230630_pkey on lzl_p20230630 lzl_61 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) ... -\u0026gt; Hash Semi Join (cost=155.48..166.00 rows=20 width=3042) Hash Cond: (lzl_30.id = t_29.id) -\u0026gt; Seq Scan on lzl_p20230630 lzl_30 (cost=0.00..10.20 rows=20 width=2912) -\u0026gt; Hash (cost=155.10..155.10 rows=30 width=40) -\u0026gt; Subquery Scan on t_29 (cost=0.14..155.10 rows=30 width=40) -\u0026gt; Limit (cost=0.14..154.80 rows=30 width=8) -\u0026gt; Append (cost=0.14..154.80 rows=30 width=8) -\u0026gt; Index Only Scan using lzl_p20230601_pkey on lzl_p20230601 lzl_931 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) -\u0026gt; Index Only Scan using lzl_p20230602_pkey on lzl_p20230602 lzl_932 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) ... -\u0026gt; Index Only Scan using lzl_p20230630_pkey on lzl_p20230630 lzl_960 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) (2041 rows) The execution plan is extremely long — 2041 rows in total. This plan is very inefficient: every time a partition is updated, the predicate conditions are run against the partitioned table all over again. Since the SQL lacks a partition key, each run scans all partitions. For a 30-partition table, each partition is scanned 30 times, totaling 900 partition scans.\nFrom the execution plan, we can see that initially 30 RTEs were allocated for UPDATE up to lzl_30. Then each hash match per partition scan also allocated 30 RTEs — for example, the hash under lzl_1 has partition scans from lzl_32 to lzl_61. Why 32 instead of 31? Because the entire partition scan is a subquery and also an RTE, named t (and t, t1-t_29), totaling 30. So the total RTEs generated in the plan are 30+30+30×30=960.\nLooking at the SELECT execution plan, it\u0026rsquo;s very different from UPDATE:\nexplain with t as (select id from lzl where id=8723 limit 100 ) select STATUS ,FILE_ID ,DATE_UPDATED from lzl where id in ( select id from t); Hash Semi Join (cost=155.48..467.05 rows=90 width=98) Hash Cond: (lzl.id = lzl_31.id) -\u0026gt; Append (cost=0.00..309.00 rows=600 width=106) -\u0026gt; Seq Scan on lzl_p20230601 lzl_1 (cost=0.00..10.20 rows=20 width=106) -\u0026gt; Seq Scan on lzl_p20230602 lzl_2 (cost=0.00..10.20 rows=20 width=106) ... -\u0026gt; Seq Scan on lzl_p20230630 lzl_30 (cost=0.00..10.20 rows=20 width=106) -\u0026gt; Hash (cost=155.10..155.10 rows=30 width=8) -\u0026gt; Limit (cost=0.14..154.80 rows=30 width=8) -\u0026gt; Append (cost=0.14..154.80 rows=30 width=8) -\u0026gt; Index Only Scan using lzl_p20230601_pkey on lzl_p20230601 lzl_32 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) -\u0026gt; Index Only Scan using lzl_p20230602_pkey on lzl_p20230602 lzl_33 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) ... -\u0026gt; Index Only Scan using lzl_p20230630_pkey on lzl_p20230630 lzl_61 (cost=0.14..5.16 rows=1 width=8) Index Cond: (id = 8723) (96 rows) No repeated (Cartesian product-style) table access — RTEs only go up to 61. This is also why SELECT succeeds on 400 partitions, because 400×400 accesses is simply too many.\nSo regarding the original SQL where UPDATE fails and SELECT succeeds, we can conclude:\nFor 400 partitions with SELECT, the execution plan has 801 RTEs, which doesn\u0026rsquo;t exceed INNER_VAR (65000), so it can generate a plan and execute. For 400 partitions with UPDATE, the execution plan has 160,160,400 RTEs, far exceeding INNER_VAR (65000), so the plan cannot be generated and throws the RTE overflow error. The cause is mostly analyzed, but the significant difference between SELECT and UPDATE plans is still puzzling. Let\u0026rsquo;s compare Oracle and MySQL execution plans horizontally.\nOracle Behavior # Oracle partitioned table with local index:\nCREATE TABLE lzl ( id number NOT NULL, partition_key number DEFAULT 0 NOT NULL, ... ) PARTITION BY RANGE (partition_key) ( PARTITION lzl_p20230601 VALUES LESS THAN (\u0026#39;20230602\u0026#39;), PARTITION lzl_p20230602 VALUES LESS THAN (\u0026#39;20230603\u0026#39;), ... PARTITION lzl_p20230630 VALUES LESS THAN (\u0026#39;20230631\u0026#39;)); create index PKLZL on lzl(id, partition_key) local; alter table lzl add constraint pklzl primary key (id, partition_key) using index pklzl; Execution plan:\nwith t as (select id from lzl where id=8723 and rownum\u0026lt;= 100 ) select STATUS ,FILE_ID ,DATE_UPDATED from lzl where id in ( select id from t) update lzl set STATUS = \u0026#39;00\u0026#39;, FILE_ID = null, DATE_UPDATED = sysdate where id in (select id from lzl where id=8723 and rownum\u0026lt;= 100) In Oracle, both SELECT and UPDATE use NESTED LOOP, accessing all partitions (PARTITION RANGE ALL). So in Oracle, regardless of SELECT or UPDATE, table t is the driving table. Because of IN, results are sorted and deduplicated. So Oracle\u0026rsquo;s plan is not 30×30 accesses but depends on the result set size in the driving table — n rows means n×30 partition accesses. Since driving table t has minimal data, this plan is fine.\nMySQL Behavior # Since MySQL only supports local indexes, just create the primary key directly:\nCREATE TABLE test ( id bigint NOT NULL, date_created timestamp , ... ) PARTITION BY RANGE (partition_key) ( PARTITION lzl_p20230601 VALUES LESS THAN (20230602), PARTITION lzl_p20230602 VALUES LESS THAN (20230603), ... PARTITION lzl_p20230630 VALUES LESS THAN (20230631)); alter table lzl add primary key pklzl(id,partition_key); MySQL starting from 5.7 shows which partitions are scanned in the execution plan (version 8.0 here).\nSELECT plan:\n\u0026gt; explain with t as (select id from lzl where id=8723 limit 100 ) -\u0026gt; select STATUS ,FILE_ID ,DATE_UPDATED from lzl where id in ( select id from t); +----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+ | 1 | PRIMARY | \u0026lt;derived3\u0026gt; | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | Start temporary | | 1 | PRIMARY | lzl | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref | PRIMARY | PRIMARY | 8 | t.id | 1 | 100.00 | End temporary | | 3 | DERIVED | lzl | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | Using index | UPDATE plan:\n\u0026gt; explain with t as (select id from lzl where id=8723 limit 100 ) -\u0026gt; update lzl set -\u0026gt; STATUS = \u0026#39;00\u0026#39;, -\u0026gt; FILE_ID = null, -\u0026gt; DATE_UPDATED = localtimestamp(0) where id in ( select id from t); +----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+ | 1 | PRIMARY | \u0026lt;derived3\u0026gt; | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | Start temporary | | 1 | UPDATE | lzl | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref | PRIMARY | PRIMARY | 8 | t.id | 1 | 100.00 | End temporary | | 3 | DERIVED | lzl | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | Using index | MySQL\u0026rsquo;s two execution plans are identical. However, the driving table selection could be better — const should be the driving table to reduce scan count.\nBug? # Bug Description # https://postgrespro.com/list/thread-id/2482006\nThis bug is easy to find via the error. It was submitted by digoal (德哥) back in 2020, followed by discussion between two source code experts. The discussion is lengthy, but to summarize: PG does not support unlimited partitions, which is understandable in the real world — too many partitions can cause rapid performance degradation. However, the community still felt the limit needed adjustment and discussed the INNER_VAR, Var.varno values in the source code.\nMisleading Nature # The bug title is somewhat misleading: BUG #16302: too many range table entries - when count partition table(65538 childs)\nThe bug seems to say the number of partition child tables can\u0026rsquo;t exceed 65,538. The discussion also mentions PG can handle up to 64K relations in a query — a query cannot have more than 64K relations.\nThis is odd because our table has 400 partitions and still throws the error. In fact, both descriptions above are not entirely accurate. The 64K limit refers to the \u0026ldquo;tables\u0026rdquo; in the execution plan, which doesn\u0026rsquo;t exactly equal real tables. Of course, if tables or partitions exceed this count, there will be problems. But even without exceeding 64K, issues can arise, as in our case with only 400 partitions.\nFix # The bug was submitted for version 12.2; our environment is 13.2.\nThis bug is fixed in PG15. The source in src/include/nodes/primnodes.h is different:\n#define INNER_VAR\t(-1)\t/* reference to inner subplan */ #define OUTER_VAR\t(-2)\t/* reference to outer subplan */ #define INDEX_VAR\t(-3)\t/* reference to index column */ #define ROWID_VAR\t(-4)\t/* row identity column during planning */ #define IS_SPECIAL_VARNO(varno)\t((int) (varno) \u0026lt; 0) As discussed in the community, PG15 not only changed VAR values to negative numbers but also converted varno to 32-bit (4 billion), compared to the previous 16-bit (65,536).\nAnd in the function that previously threw the error, add_rte_to_flat_rtable() in src/backend/optimizer/plan/setrefs.c, the error code has been completely removed! The entire PG15 source code no longer contains too many range table entries!\nSummary # PG still has room for improvement in partitioned table optimization. PG treats child partitions as regular tables, unlike Oracle and MySQL. Oracle treats child partitions as segments distinct from tables. This causes PG to output the access method for every partition in the execution plan (when pruning doesn\u0026rsquo;t occur), making plans extremely long when there are many partitions. Oracle just writes PARTITION RANGE ALL. MySQL also prints all partitions but doesn\u0026rsquo;t treat each partition\u0026rsquo;s access as a subquery, reducing plan complexity. Even when partitions haven\u0026rsquo;t reached 64K, you can still get too many range table entries. This limit is actually on execution plan RTE count, not partition count (though if partition count reaches this number, RTE count will too, as mentioned — PG prints access methods for all partitions). The too many range table entries error is resolved in PG15. For versions below 15, don\u0026rsquo;t create too many partitions! You can also leverage partition pruning to reduce accessed partitions — in this case, simply adding a partition key condition to the WHERE clause would work. ","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/too-many-range-table-entries-even-with-not-that-many-partitions/","section":"Posts","summary":"Problem Description # PostgreSQL UPDATE statement throws error: too many range table entries\nOriginal SQL:\nwith t as (select id from LZLTAB where id=8723 limit 100 ) update\tLZLTAB set STATUS = '00', FILE_ID = null, DATE_UPDATED = localtimestamp(0) where id in (select\tid from t) If we rewrite UPDATE as SELECT, it succeeds:\nwith t as (select\tid from\tLZLTAB where\tid=8723 limit 100 ) select * from LZLTAB where\tid in (select id\tfrom t) id | date_created ------+----------------------------+... 8723 | 2023-06-21 18:02:21.161687 (1 row)\tPrimary key and partitions — 400 partitions total:\n","title":"Too Many Range Table Entries Even with Not-That-Many Partitions","type":"posts"},{"content":" Vector Database Core Concepts # A Bit of History # The development history of LLM models, from Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond1:\nMany people only gradually learned about large models after the ChatGPT explosion, but in the years before that tipping point, the development of large models had already begun a war of the gods. Several institutions published many revolutionary papers — on the corporate side: Google, DeepMind, OpenAI, Meta, Microsoft; on the academic side: Stanford, Berkeley, CMU, Princeton, MIT2.\nThere are three main camps:\nGoogle \u0026amp; DeepMind camp — Gemini, Bard Microsoft \u0026amp; OpenAI camp — ChatGPT, Bing Meta open-source community camp — Llama Timeline of recent large model product releases, from A Survey of Large Language Models3:\nGenerative AI Basics # AIGC (Artificial Intelligence Generated Content): The precise concept of AIGC is a mode of production that uses AI to automatically generate content. In a broader sense, AIGC can be approximated as AI technology trained to possess human-like generative and creative capabilities — i.e., Generative AI. It can autonomously generate and create new text, images, music, videos, 3D interactive content, and various other forms of content and data based on data and generative algorithm models, and even includes enabling new scientific discoveries and creating new meanings.\nLLM (Large Language Model): LLMs are large language models capable of capturing and processing complex language patterns and semantics — that is, they can understand and generate human language. GPT-3, ChatGPT, BERT, T5, ERNIE Bot, and others are typical large language models.\nNLP (Natural Language Processing): Natural Language Processing (NLP) studies how to enable computers to read and understand human language — i.e., converting natural human language into instructions that computers can process. LLM is an important component of NLP.\nAIGC has achieved remarkable growth, largely due to Natural Language Processing (NLP), and the biggest driver behind NLP\u0026rsquo;s progress is the Large Language Model (LLM). This year (2024), AIGC is also developing rapidly in areas such as video and audio.4\nprompt: Instructions or directives — natural language provided to AI describing a task, used to guide a language model (such as GPT-3 or GPT-4) to generate the corresponding output5. (Everyone basically knows what this is already, no need to elaborate.)\nembedding:\nEmbedding is a method of representing objects (such as text, images, and audio) as points in a continuous vector space, where the positions of these points in space carry semantic meaning for machine learning algorithms.\nBased on GloVe word-vector relevance for English words, there is an interactive 2D embedding explorer. This shows natural language embedded as 2D vectors:\nRAG # RAG (Retrieval-Augmented Generation) is a two-stage process consisting of document retrieval and large language model (LLM) answer generation. The initial stage leverages dense embeddings to retrieve documents. Depending on the specific use case, this retrieval can be based on various database formats, such as vector databases, summary indexes, tree indexes, and key indexes5.\nThe original RAG paper6 was published on May 22, 2020, by researchers from Facebook (Meta), University College London, and New York University, proposing a general fine-tuning approach for RAG. RAG includes the following characteristics2:\nRAG models combine pre-trained memory to assist language generation RAG models generate language that is more specific, diverse, and factual On March 23, 2023, OpenAI released the chatgpt-retrieval-plugin repository, recommending the use of vector databases in RAG. From that point on, vector databases gained widespread attention in the application domain, riding the wave of large model popularity.\nWhat Can Vector Databases Bring to AI? # Vector databases can provide large models with data retrieval and long-term data storage capabilities within RAG7.\nWhy use RAG? No words carry more weight than those of the master, OpenAI. The following passage is from the retrieval plugin usage guide released by OpenAI in March 20238, translated by ChatGPT:\nThe open-source retrieval plugin enables ChatGPT to access personal or organizational information sources (with permission). Users can ask questions or express needs in natural language and obtain the most relevant document snippets from their data sources (such as files, notes, emails, or public documents).\nAs an open-source and self-hosted solution, developers can deploy their own version of the plugin and register it with ChatGPT. The plugin leverages OpenAI\u0026rsquo;s embeddings and allows developers to choose a vector database (such as Milvus, Pinecone, Qdrant, Redis, Weaviate, or Zilliz) to index and search documents. Information sources can be synchronized with the database using webhooks.\nIn short, OpenAI recommends everyone use vector databases.\nHas the vector database cooled off? Not only has it not cooled off — RAG has developed to the point of being everywhere today — Has RAG Technology Really Become \u0026ldquo;Commonplace\u0026rdquo;?. And vector databases, with their high retrieval efficiency, data storage reliability, and other characteristics, are an important part of RAG.\nCommon Vector Databases # Since OpenAI released the RAG repo, many vector databases have emerged (though some existed before). Several companies have also secured considerable funding9:\nCompany Headquartered in Funding Weaviate 🇳🇱 Amsterdam $68M Series B Qdrant 🇩🇪 Berlin $11M Seed Pinecone 🇺🇸 San Francisco $138M Series B Milvus/Zilliz 🇨🇳 / 🇺🇸 Redwood City $113M Series B Chroma 🇺🇸 San Francisco $20M Seed LanceDB 🇺🇸 San Francisco Venture Vespa 🇳🇴 / 🇺🇸 Indianapolis Yahoo! Vald 🇯🇵 Tokyo Yahoo! Japan Vector database release timeline:\nVector database performance comparison10:\nDedicated vector databases generally perform better than traditional databases with vector plugins, for roughly two reasons:\nDedicated vector databases are built with vector-specific underlying storage, and their performance is generally better than untargeted traditional databases. Dedicated vector databases are generally newer (mostly implemented in Go or Rust), making code-level optimization easier. However, this does not mean plugin-based vector databases have no place:\nTraditional databases natively support more features, not just similarity computation. ACID — traditional database storage is safer. It\u0026rsquo;s easier to manipulate data within a single database. Vector database feature comparison:\nThe description of pgvector above is no longer entirely accurate — pgvector now supports HNSW, and the pgvector ecosystem project pgvectorscale also supports DiskANN.\nMathematical Concepts # Mathematics says: \u0026ldquo;I stand on the mountaintop watching you all play.\u0026rdquo;\nScalar # A scalar is a specific number. Scalars have no direction and are generally defined in contrast to vectors.\nVector # In Euclidean space, a vector has both magnitude and direction. For example, vector a from point A to point B (contains information about both points and direction)11:\nUnit Vector # A vector with magnitude one is a unit vector. The unit vector equals the vector divided by its Euclidean length12: $$ \\vec a = \\frac{\\mathbf a}{||\\mathbf a||} $$\nIn mathematics, the Unit Vector is called a \u0026ldquo;normalized vector\u0026rdquo; in pgvector and OpenAI embeddings. (Note: do not confuse this with the mathematical concept of the normal vector — a normal vector is a different concept entirely.)\nWhy use unit vectors?\nOpenAI embeddings\u0026rsquo; explanation for using unit vectors13:\nOpenAI embeddings are normalized to length 1, which means that:\nCosine similarity can be computed slightly faster using just a dot product Cosine similarity and Euclidean distance will result in the identical rankings Sparse Vector # Sparse vectors are called \u0026ldquo;sparse\u0026rdquo; because the information in the vector is sparsely distributed. Typically, we need to find a few ones (relevant information) among thousands of zeros. Therefore, these vectors can contain many dimensions, usually in the tens of thousands.\nComparison of sparse and dense vectors: Sparse vectors contain sparsely distributed bits of information, while dense vectors carry more information in every dimension — information-dense.14\nEuclidean Space # Simply called Euclidean space, it is the most fundamental space in mathematics. In modern mathematics, a space of positive integer n dimensions is called Euclidean space.\nThere are other space definitions, such as inner product space and Hilbert space. They differ in mathematical definitions, but in database/real-world contexts, the distinctions are not so fine-grained. The key takeaway is that inner product space, Euclidean space, and Hilbert space can all contain elements such as points, vectors, and inner products — we can simply call them \u0026ldquo;multi-dimensional spaces\u0026rdquo;. For their differences, see A Casual Discussion of Various Spaces in Mathematics15.\nEuclidean Distance # Simply called Euclidean distance, this is what we generally think of as the distance between points — i.e., the length of a line segment16.\nIn 2D space, the Euclidean distance between points q and p is: $$ d(\\mathbf p,\\mathbf q)=\\sqrt{(p_1-q_1)^2+(p_2-q_2)^2} $$\nIn n-dimensional space, the Euclidean distance between points q and p is: $$ d(\\mathbf p,\\mathbf q)=\\sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\\cdots+(p_n-q_n)^2} $$\nManhattan Distance (or Taxicab Distance) # $$ d(\\mathbf p,\\mathbf q)= \\sum_{i=1}^n | p_i-q_i| $$\nManhattan distance is the sum of the absolute differences of two points across each dimension17.\nIn the figure above, the green line is Euclidean distance; the red, yellow, and blue lines are Manhattan distances.\nMinkowski Distance # $$ d(\\mathbf a,\\mathbf b)= \\left( \\sum_{i=1}^n | a_i-b_i|^p \\right)^{1/p} $$\nThe figure below shows the distance from the origin to a point of unit length at different values of p in Minkowski distance18:\nWhen p=1, it is Manhattan distance, also written as \u0026ldquo;L1 distance\u0026rdquo; When p=2, it is Euclidean distance, also written as \u0026ldquo;L2 distance\u0026rdquo; When p=n, it is Minkowski distance, also written as \u0026ldquo;Ln distance\u0026rdquo; Cosine Similarity # The cosine value of the angle between two vectors — also called cosine similarity. Cosine similarity depends only on the angle between the two vectors, not on the vectors\u0026rsquo; lengths19.\nThe smaller the angle between two vectors, the larger the cosine similarity. Value range: [-1, 1]. cos(0)=1, cos(90)=0, cos(180)=-1.\nCosine similarity between two vectors is written as: $$ cos (\\theta) $$ Expressed in vector form: $$ cos (\\theta)=\\frac{\\mathbf a\\cdot \\mathbf b }{||\\mathbf a|| , ||\\mathbf b||}= \\frac{ \\sum_{i=1}^n \\mathbf a_i \\mathbf b_i}{ \\sqrt {\\sum_{i=1}^n \\mathbf a_i ^2} \\cdot \\sqrt {\\sum_{i=1}^n \\mathbf b_i ^2}} $$\nInner Product # Also called the dot product, it can be used to represent the length and angle of vectors. The inner product equals the Euclidean distance of the vectors multiplied by the cosine of the angle between them.\nInner product in 2D space: $$ \\mathbf a\\cdot \\mathbf b=||\\mathbf a|| , ||\\mathbf b||, cos \\theta $$ or $$ \\mathbf a\\cdot \\mathbf b= a_1 b_1 + a_2 b_2 $$ Inner product in n-dimensional space (a=[a1,a2,···,an], b=[b1,b2,···,bn]): $$ \\mathbf a\\cdot \\mathbf b=\\sum_{i=1}^n a_ib_i= a_1b_1 + a_2b_2 + \\cdots + a_nb_n $$\nNow the following diagram should make sense. Using the formulas above, you can also reverse-engineer what the distance operators mean for n-dimensional vectors.\nThey are: Euclidean distance, cosine distance, and inner product20.\nAll three can describe the similarity between two vectors.\nEuclidean distance: contains only distance information between the two vectors Cosine distance: contains only angle information between the two vectors Inner product: contains both distance information and angle information Of course, there are more mathematical models for vector similarity computation, but it depends on whether the vector database supports them.\nJaccard Distance # In short: intersection divided by union21.\nFormula: $$ J(A,B)= \\frac{|A\\cap B| }{|A \\cup B|} $$\nExpressed in vectors, it computes the ratio of the count of equal elements to the count of unequal elements22.\nHamming Distance # The number of differing positions between two strings or vectors of equal length23.\nExamples:\n\u0026ldquo;karolin\u0026rdquo; and \u0026ldquo;kathrin\u0026rdquo; is 3. \u0026ldquo;karolin\u0026rdquo; and \u0026ldquo;kerstin\u0026rdquo; is 3. \u0026ldquo;kathrin\u0026rdquo; and \u0026ldquo;kerstin\u0026rdquo; is 4. 0000 and 1111 is 4. 2173896 and 2233796 is 3. Illustration24:\nDelaunay Triangulation # Delaunay triangulation is an operation on a set of points in a plane. It subdivides the convex hull of these points (which contains multiple points) into multiple triangles, where the circumcircle of each triangle contains no point from the set. This maximizes the minimum angle among all triangles and tends to avoid producing skinny triangles25.\nDoes NOT satisfy \u0026ldquo;the circumcircle of each triangle contains no point from the set\u0026rdquo;:\nDOES satisfy \u0026ldquo;the circumcircle of each triangle contains no point from the set\u0026rdquo;:\nFor example, triangulating a point set:\nA valid triangulation:\nDelaunay triangulation is not actually an algorithm — it merely defines what a \u0026ldquo;good\u0026rdquo; triangular mesh looks like. Its excellent properties are the empty-circle property and the maximized-minimum-angle property. These two properties avoid the creation of skinny triangles and make Delaunay triangulation widely applicable.\nVoronoi Diagram # Delaunay triangulation is a triangulation of a discrete point set P in general position, and it corresponds to the dual graph of P\u0026rsquo;s Voronoi diagram. The circumcenters of Delaunay triangles are the vertices of the Voronoi diagram. In 2D, Voronoi vertices are connected by edges, which can be derived from the adjacency relationships of Delaunay triangles: if two triangles share an edge in the Delaunay triangulation, their circumcenters should be connected by an edge in the Voronoi tessellation26:\nThe key property of a Voronoi diagram is: the distance from a centroid to any point within its region is smaller than the distance from that point to any other centroid. $$ R_k={x \\in X ,|,d(x,P_k) \\le d(x,P_j) ; \\mathrm{for ,all },j \\neq k} $$ Rk is the centroid, d(x,Pk) is the distance from the centroid to any point within its region, and d(x,Pj) is the distance from other centroids to any point in that region.\nDue to different ways of computing the distance d, Voronoi diagrams can take on different appearances27:\nVector Database Indexes # Nearest Neighbor Search # ENN (Exact Nearest Neighbor): Finding the point or vector closest to a query point in a given dataset. This method guarantees the highest accuracy, but as the dataset size increases, the computational cost rises sharply because it requires evaluating the distance between the query point and every point in the dataset.\nANN (Approximate Nearest Neighbor): To improve efficiency, approximately finding the nearest point to the query point at the cost of some accuracy. This method is implemented through various algorithms and can significantly reduce computational cost, especially effective when dealing with large-scale datasets.\nKNN (K-Nearest Neighbors): A commonly used machine learning algorithm that works by finding the K nearest neighbors to a given query point in the dataset.\nIndex Evaluation Criteria # Evaluating the quality of an index always depends on the specific data model, but in general, it includes the following points:\nQuery time: Query speed is critical, especially important in large model contexts. Query quality: ANN queries won\u0026rsquo;t always return perfectly accurate results, but the query quality must not deviate too much. Query quality has many metrics, the most common being recall. Memory consumption: The memory consumed by the query index — searching in memory is clearly faster than searching on disk. Training time: Some search methods require training to reach a good state. Write time: The impact on the index when writing vectors, including all maintenance. Most of these metrics are straightforward. Here we\u0026rsquo;ll focus on query quality:\nIn ANN search, results are not always exact. When searching a set of elements, the concepts include: the query scope (retrieved elements), all correct elements (relevant elements), the returned correct elements (true positives), and the returned incorrect elements (false positives)28:\nTP = True positive; FP = False positive; TN = True negative; FN = False negative\nAccuracy: $$ Accuracy=\\frac{TP+TN}{TP+FP+TN+FN} $$ or: $$ Accuracy=\\frac{\\text{all correct elements}}{\\text{all elements}} $$\nPrecision: $$ Precision=\\frac{TP}{TP+FP} $$ or: $$ Precision=\\frac{\\text{retrieved correct elements}}{\\text{all retrieved elements}} $$\nRecall: $$ Recall=\\frac{TP}{TP+FN} $$ or: $$ Recall=\\frac{\\text{retrieved correct elements}}{\\text{all correct elements}} $$\nF-measure: Equivalent to weighted precision and recall $$ Recall=2 \\cdot \\frac{precision \\cdot recall}{precision+recall} $$\nExample: Consider a computer program designed to identify dogs (and related elements) in digital photos. When processing a photo containing ten cats and twelve dogs, the program identifies eight dogs. Among the eight identified as dogs, only five are actually dogs (true positives), while the other three are cats (false positives). Seven dogs were missed (false negatives), and seven cats were correctly excluded (true negatives). For this program:\nAccuracy = 12/(10+12) (largely independent of the identification program itself) Precision = 5/8 (true positives / all retrieved elements) Recall = 5/12 (true positives / all correct elements) F-measure = 2*[(5/18)*(5/12)]/[(5/18)+(5/12)] Locality-Sensitive Hashing (LSH) # LSH is a method for narrowing the search scope by converting data vectors into hash values while preserving information about their similarity.\nLSH Construction # LSH has many implementations. Here we introduce the more traditional one. This traditional LSH implementation consists of three parts22:\nShingling: Encode the original text into vectors. MinHashing: Convert the vectors into a special representation called a signature, used for comparing similarity between them. LSH function: Hash the signatures into different buckets. If a pair of vectors\u0026rsquo; signatures fall into the same bucket at least once, they are considered candidates. Shingling # Shingling is a method of embedding (in my personal opinion). Shingling identifies natural language as k consecutive tokens, with duplicate tokens removed22:\nAt this point, we have a set of tokens based on k-grams. The next step is to convert them into vectors.\nStart with an all-zero vector, whose length equals the length of the token set. Set the position corresponding to each token to 1:\nThe final result is a very long vector containing only 0s and 1s, where the vector\u0026rsquo;s information captures the semantics of a sentence.\nMinHashing # Since the vector dimensionality is extremely high, directly computing approximate distances using one-hot encoded vectors yields very poor results. We need to convert sparse vectors into dense vectors — this process is called MinHashing in LSH, and the converted vector is called a MinHashing signature.\nMinHashing can be a bit tricky for beginners at first, but once you grasp it, you\u0026rsquo;ll find it very simple.\nMinHashing is a hash function that permutes the components of an input vector and then returns the first index where the permuted vector component equals 1.\nFirst, apply a permutation: rearrange the components of a vector. Return the index of the first element that equals 1 after permutation. For example:\nu1 vector (0,0,1,1,0): after the first random permutation, the corresponding index is 0; after the second random permutation, the corresponding index is 029. u1\u0026rsquo;s MinHashing signature is (0,0).\nIn practice, multiple minhash values can be used to approximately compute the Jaccard similarity between vectors. The more minhash values used, the more accurate the approximation.\nLSH Function # Even after converting sparse vectors into dense vectors, the dense vectors can still have high dimensionality, making direct retrieval inefficient.\nWe can improve query efficiency using hash tables. However, note that using a completely random hash algorithm easily places nearby vectors into different hash buckets. We need a hash algorithm that places nearby vectors into the same hash bucket — this is LSH: Locality-Sensitive Hashing.\nThe LSH mechanism builds a hash table consisting of several parts which puts a pair of signatures into the same bucket if they have at least one corresponding part.\nThe concept of locality-sensitive hashing is also simple: split the signature into bands, compute hash values for each sub-signature band, and designate those with colliding sub-hash values as candidates.\nThe following example is easy to understand — read through it:\nThinking in terms of extremes: b=1 means no banding at all — direct hashing, completely defeating the purpose of LSH. b=number of signature elements means one band per element, i.e., one hash value per element — this can achieve relatively accurate approximate comparison, but it imposes a massive burden on computation and memory.\nLSH Parameters and Error Rate # The probability that a vector becomes a candidate vector directly affects recall. The probability of a candidate vector is as follows, where:\ns represents similarity b represents the number of bands r represents the number of rows per band If we plot P against s using the formula, the relationship between vector similarity and candidate probability is as follows:\nThe larger the number of bands b, the smaller the candidate similarity probability.\nAt the same time, adjusting b and s affects P, and P is related to FP and TN.\nFor example, returning more candidates naturally leads to more false positives — i.e., returning non-similar \u0026ldquo;candidate pairs.\u0026rdquo; This is an inevitable consequence of modifying the parameter b.\nTP = True positive; FP = False positive; TN = True negative; FN = False negative\nLSH is susceptible to high-dimensional data: more dimensions require longer signatures and more computation to maintain good search quality. In such cases, other indexes are recommended.\nMore # There are two more articles I haven\u0026rsquo;t finished digesting — they seem to be related to binary vectors and Euclidean distance:\nhttps://towardsdatascience.com/similarity-search-part-6-random-projections-with-lsh-forest-f2e9b31dcc47\nhttps://towardsdatascience.com/similarity-search-part-7-lsh-compositions-1b2ae8239aca\nHNSW Index # The HNSW algorithm (Hierarchical Navigable Small World) is a multi-layer graph-based proximity algorithm. HNSW is currently one of the most popular vector index algorithms.\nAt a high level, HNSW is based on the Small World Theory. The Small World Theory originally stems from the Six Degrees of Separation theory in social psychology — any two people can be connected through at most five layers of social relationships. In other words, any two people on Earth can be connected through at most six steps of social connections. The Small World Theory was later widely accepted through experimental and empirical evidence and extended to non-social relationship networks. Note that the Small World Theory is a phenomenon.\nIn short, the Small World Theory explains that \u0026ldquo;the connection between two entities is actually very short.\u0026rdquo; What HNSW does is establish connections between elements and reduce the number of connections.\nHNSW Index Construction # Let\u0026rsquo;s look at the HNSW paper\u0026rsquo;s algorithm for constructing HNSW graph layers30:\nSeveral elements in the construction algorithm are important:\nM is the number of new edges (connections) added, representing the number of new edges for a newly inserted node. Mmax is the maximum number of edges per node. If neighboring nodes are inserted continuously, the edge count of existing neighboring nodes could keep increasing, wasting computational resources during search. When inserting a new node causes an existing neighboring node\u0026rsquo;s edge count to exceed Mmax, shrink connection is needed. efConstruction is the set of neighboring nodes. Construction illustration31:\nSteps for HNSW node insertion (without shrink connection):\nWhen a new node is inserted, first find neighboring nodes at the top layer using efConstruction. Use the found nearest neighbor as the entry point to descend to the next layer, then continue searching for neighbors using that layer\u0026rsquo;s efConstruction. Perform node insertion at a certain layer (e.g., L=2). Select M nodes from efConstruction and connect them to the new node — at this point, 1 new node is added with M edges connected to it. Repeat step 2 until reaching the bottom layer (layer0). HNSW Heuristic Neighbor Selection # The basic HNSW index structure construction has another problem: if two clusters are relatively far apart, according to the basic HNSW construction algorithm, the two clusters are almost impossible to connect, because the basic HNSW construction algorithm is built on the nearest neighbor nodes in efConstruction.\nThe HNSW original paper not only proposed the basic HNSW construction algorithm but also introduced a heuristic algorithm for solving the isolated cluster problem:\nFig.2 Heuristic for selecting graph neighbors for two isolated clusters. A new element is inserted on the boundary of cluster 1. All the element\u0026rsquo;s nearest neighbors belong to cluster 1, thus missing the Delaunay triangulation edges between the clusters. However, the heuristic selects element e2 from cluster 2, so if the inserted element is closer to e2 than to any other element from cluster 1, global connectivity is maintained.\n\u0026ldquo;The heuristic algorithm not only considers the nearest distance between nodes in the graph but also considers connectivity between different regions of the graph.\u0026rdquo;\nAs shown below, when adding node X, the heuristic algorithm should be applied here — establishing connectivity with cluster A, rather than simply adding to the nearest neighbor nodes:\nHNSW Index Search # The main logic of HNSW\u0026rsquo;s KNN search method as described in the HNSW original paper consists of the following two algorithms:\nAlgorithm 2 appears slightly more complex, but the logic is actually simple — Algorithm 2 finds the set of nearest neighbor nodes ef for q at that layer. In simple terms, Algorithm 2 adds candidate nodes to the ef set, compares distances, and removes the farthest nodes, so the returned W is the ef for q at that layer. Algorithm 5 returns the K nearest neighbor nodes of q. It calls Algorithm 2 twice (or more). The first line in the for loop has input parameter ef=1, meaning non-bottom layers only find the single nearest ep (entry point). The bottom layer (lc=0) returns the K nearest neighbor node set W. HNSW Complexity # The number of HNSW layers is a function of log(N).\nSearch complexity: Complexity can be rigorously evaluated in a Delaunay graph, with the average complexity being O(log(N)) (for non-Delaunay graphs, such as graphs with heuristic neighbor selection, the paper does not provide a specific complexity formula).\nConstruction complexity: HNSW is constructed by iteratively inserting all elements, with average complexity O(N∙log(N)).\nHNSW Index Parameters # Generally, HNSW indexes for vector data have several adjustable parameters that affect index construction speed, recall, etc. Different databases may have slightly different parameters. Here we use pgvector\u0026rsquo;s HNSW parameters as an example:\nIndex construction parameters:\nm: Maximum number of edges per vector, default 16. Equivalent to Mmax in the paper. ef_construction: Number of vectors in the neighbor list during index construction, default 64. Equivalent to ef_construction in the paper. Index search parameters:\nhnsw.ef_search: Adjusts the number of vectors in the neighbor list during search (also equivalent to ef_construction in the paper). Must be greater than or equal to limit. Impact of adjusting ef_construction on creation time and recall during index construction20:\nIncreasing ef_construction improves recall but extends index creation time. After ef_construction=256, index construction time increases noticeably but recall improvement is not obvious.\nIncreasing m also improves recall and extends index creation time. After m=36, index construction time increases noticeably but recall improvement is not obvious.\nSimilarly, increasing hnsw.ef_search improves recall at the cost of performance.\nIVFFlat Index # IVFFlat stands for Inverted File with Flat Compression. (What\u0026rsquo;s the relationship with \u0026ldquo;invert\u0026rdquo;? Do all indexes that can\u0026rsquo;t be categorized get called inverted?) The core concept of the IVFFlat index is based on the Voronoi diagram:\nThe key property of a Voronoi diagram is: the distance from a centroid to any point within its region is smaller than the distance from that point to any other centroid.\nThis property is expressed in formula form: $$ R_k={x \\in X ,|,d(x,P_k) \\le d(x,P_j) ; \\mathrm{for ,all },j \\neq k} $$ Rk is the centroid, d(x,Pk) is the distance from the centroid to any point within its region, and d(x,Pj) is the distance from other centroids to any point in that region.\nUsing this concept, we can partition many vectors into regions by setting centroids, and then use the Voronoi diagram property to roughly find nearby points.\nIVFFlat Index Construction # Let\u0026rsquo;s reduce high-dimensional space to 2D for understanding IVFFlat index construction32.\nFor example, the following set of X marks represents points (or vectors). Suppose we have three centroids:\nThe three centroids partition 3 Voronoi cells, and all points are assigned to their respective Voronoi cells:\nIVFFlat Index Search # Now there is a query node. Compute its distance to all centroids, find the nearest centroid, and the cell containing that centroid is the region to search next. Finally, within that region, find the neighboring nodes33:\nBoundary Problem:\nThe above search path has a boundary problem. When the query is near a region boundary, if the true nearest node is in another region, the algorithm of \u0026ldquo;only searching for neighboring nodes within the region\u0026rdquo; will not find the true nearest neighbor.\nThe boundary problem is fundamentally because:\nThe Voronoi diagram only guarantees that the distance from a node to its own region\u0026rsquo;s centroid is smaller than the distance to other centroids, but it does NOT guarantee that the distance from a node to other nodes in its own region is smaller than the distance to nodes in other regions.\nThis problem can be mitigated by increasing the number of regions searched. For example, increasing the number of regions searched from 1 to 3:\nIncreasing the number of search regions is generally set as a parameter in databases, such as ivfflat.probes in pgvector.\nIVFFlat Search Summary:\nCompute the distance from the query node to all other centroids, find the nearest one. Based on the input parameter for the number of cells to query (e.g., probes), search for neighboring points in the top probes cells. IVFFlat Index Parameters # Similarly, vector databases that support IVFFlat indexes generally have at least two parameters: list and probe. These parameters affect index search performance and recall. Here we use Faiss parameters as an example32.\nnlist: Number of regions to construct. Increasing nlist increases the time to search for the nearest centroid but reduces the time to search for nodes within a region. nprobe: Number of regions to search. Increasing nprobe increases the number of regions searched, which obviously reduces search performance but improves recall. Theoretically, for nlist, it\u0026rsquo;s best to test specifically against the structure of the vector data and the database type — increasing nlist does not always reduce response time. For nprobe, increasing nprobe definitely reduces search performance and improves recall, but making nprobe too large is meaningless and goes against the original intent of ANN.\nThe following is from Pinecone\u0026rsquo;s performance testing of the Faiss IVFFlat index:\nPQ Product Quantization # One million dense vectors may require gigabytes of memory, and real-world vectors far exceed this number. Without management, similarity vector search can require enormous amounts of memory — yet RAM is limited. Vector size increases with vector dimensionality and the number of vectors.\nProduct Quantization (PQ) aims to reduce memory usage and can also improve query speed (because the amount of computation is reduced). PQ is a lossy compression method, which leads to reduced vector retrieval accuracy, but this is acceptable within ANN requirements.\nPQ\u0026rsquo;s algorithm logic is slightly more complex than other algorithms. I strongly recommend this article: Similarity Search, Part 2: Product Quantization34.\nPQ Construction # Step description:\nSubvectors — Split the original high-dimensional vector into n low-dimensional sub-vectors. Codebook — Use the k-means algorithm (or other algorithms) to compute the Voronoi diagram for each set of all sub-vectors, producing n different Voronoi diagrams. These Voronoi diagrams are the codebooks (assuming each Voronoi diagram has k centroids). Clustering — Place the n sub-vectors into their respective already-clustered Voronoi diagrams and compute the nearest centroid. Quantized vectors — Take these n nearest centroids as the new vector — the quantized vector. Reproduction values — Take the nearest centroid index for each of the n subspaces as new values; the combined new values are called the PQ code. Step 5, reproduction values, in detail:\nBased on the n sub-vectors and the k centroids in each subspace, we obtain an n×k centroid matrix. Taking the index of the nearest centroid for each sub-vector gives the PQ code.\n(btw: to be rigorous, all element indices in the diagram below should start from 1, not 0.)\nThe new PQ code is equivalent to a lossy-compressed new vector (reproduction value) of the original vector. New distance calculations can directly compute the L2 distance of the PQ codes.\nPQ Retrieval # Based on the PQ original paper35, there are two PQ retrieval modes:\nSymmetric mode: The distance between vector x and vector y is approximated by the distance between their centroids q(x) and q(y). In other words, the distance between two vectors can be approximated by the distance between their PQ codes. Asymmetric mode: The distance between vector x and vector y is approximated by the distance from x to the centroid q(y). In other words, the distance between two vectors can be computed using the original query vector value and the other vector\u0026rsquo;s PQ code. Clearly, the distance accuracy differs between the two modes:\nThe figure above shows the distance accuracy between two vectors under different modes, with 8 subspaces and 256 centroids. It can be seen that the asymmetric mode has higher accuracy than the symmetric mode.\nWhen comparing distances between two vectors, the symmetric and asymmetric distance computation models are quite useful. However, in the scenario of finding PQ approximate vectors, there are some differences — especially the symmetric mode, where distortion can be quite severe:\nThe symmetric mode\u0026rsquo;s query speed is very fast because the code table has already been computed and preserved during the PQ construction process. You only need to first compute the query vector x\u0026rsquo;s PQ code via the code table (minimal computation), then reverse-lookup the code table to get the corresponding sub-code-table — all vectors in this sub-code-table are approximate vectors at equal distance. This method requires extremely little computation — just a direct table lookup.\nThe symmetric mode\u0026rsquo;s distortion is relatively severe (the two figures above don\u0026rsquo;t fully capture it — imagine it as a Voronoi diagram where one cell contains multiple vectors, and you\u0026rsquo;ll realize how severe the symmetric distortion can be). The asymmetric mode can slightly alleviate this problem. In asymmetric mode, first compute the PQ code of vector x, then similarly reverse-lookup the code table to get the corresponding sub-code-table, then compute distances between vector x and the vectors in this sub-code-table to obtain KNN. Its computational cost is n×km (n = number of subspaces, km ≈ total vector count / centroid count).\nAsymmetric mode requires finding the centroid via the PQ code, then searching for KNN within the subspace where the centroid resides. The distance between the query vector x and an existing vector y is approximated by the distance between x and y\u0026rsquo;s centroid.\nPQ asymmetric retrieval34:\nSteps of PQ asymmetric retrieval:\nSplit the query vector into multiple sub-vectors. Compute the distance between sub-vectors and the centroid matrix. Take the nearest centroid in each subspace as the query vector\u0026rsquo;s PQ code. Compute the approximate distance using the query vector and the centroid corresponding to the PQ code. Distances can be computed independently in each subspace and then summed. As mentioned earlier, asymmetric mode\u0026rsquo;s approximate distance computation is slightly better than symmetric mode, but in some scenarios, the asymmetric distance can still deviate significantly from the actual distance:\nThis is easier to understand from the figure above. Within the same cell, the distance between the farthest vector and the centroid can differ significantly from the distance between the closest vector and the centroid. Computing only the partial distance to the centroid cannot capture this difference.\nPQ Parameters and Their Impact # PQ has at least two parameters that significantly affect performance and memory: the number of subspaces m and the number of centroids per subspace k.\nRecall:\nThe product quantizer is parametrized by the number of subvectors m and the number of quantizers per subvector k*, producing a code of length m × log2 k\nWith m subspaces, each having k* centroids, the length (in bits) of a PQ code is35: $$ code ; length , (bits)=m \\cdot \\log_2 k^* $$\nThe more subspaces m, the higher the recall; the longer the PQ code, the higher the recall. Longer PQ code essentially means more centroids. Note that the specific values here are based on the paper\u0026rsquo;s dataset.\nMemory and complexity:\nk represents the number of cluster centroids, D represents dimension, m represents the number of subspaces. k* represents centroids within a subspace, D* represents dimensions within a subspace.\nFor example, with k=2048, D=128, m=8, the complexity is as follows36:\nOperation Memory and complexity k-means kD = 2048×128 = 262144 PQ mkD = (k^(1/m))×D = (2048^(1/8))×128 = 332 It can be seen that PQ significantly reduces complexity during search.\nDiskANN \u0026amp; Vamana # The DiskANN original paper Abstract37:\nCurrent state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset. We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD).\nAt the time (the paper was published in 2019), state-of-the-art ANN algorithms all relied on RAM for high recall and performance. This approach was not only expensive but also limited dataset size. DiskANN requires only 64GB RAM and an affordable SSD.\nVamana Construction # Vamana iteratively builds a directed graph, starting from a random graph where each node represents a data point in the vector space. Initially, the graph is highly connected — all nodes are connected to each other. The graph is then optimized using an objective function that aims to maximize connectivity between the closest nodes. This is achieved by pruning most random short-range edges while adding certain long-range edges that connect distant nodes (to accelerate graph traversal)37.\nThe figure shows 200 2D points after two iterations. The first iteration aggressively prunes edges but also removes long edges that reduce path length; when alpha is increased to relax the pruning condition, long edges are added back38. For the specific algorithm, refer to the paper — this is roughly the idea.\nThe DiskANN Algorithm # From the paper\u0026rsquo;s \u0026ldquo;The DiskANN Index Design\u0026rdquo;:\nThe high-level idea is simple: run Vamana on a dataset P and store the resulting graph on an SSD. At search time, whenever Algorithm 1 requires the out-neighbors of a point p, we simply fetch this information from the SSD. However, note that just storing the vector data for a billion points in 100 dimensions would far exceed the RAM on a workstation! This raises two questions: how do we build a graph over a billion points, and how do we do distance comparisons between the query point and points in our candidate list at search time in Algorithm 1, if we cannot even store the vector data?\nRun Vamana on the vector set and store it on SSD. When the dataset is very large, two problems must be addressed:\nHow to index such a large-scale dataset with limited memory resources? k-means + Vamana stacking algorithm: First, use k-means to partition the data into k clusters, then assign each point to the nearest i clusters. Usually, i=2 is sufficient. Build an in-memory Vamana index for each cluster, and finally merge the k Vamana indexes into one.\nIf the original data cannot be loaded into memory, how to compute distances during search? Use compressed vectors (e.g., PQ) and store the compressed vectors in main memory.\nIf index data is stored on SSD, disk access count and disk read/write requests must be minimized to ensure low search latency; at the same time, lossy compression reduces recall. Therefore, the DiskANN paper proposes three optimization strategies:\nBeam Search: Simply put, preload neighbor information. When searching for point p, if p\u0026rsquo;s neighbors are not in memory, they must be loaded from disk. Since the time required for a small number of random SSD accesses is roughly the same as the time for a single SSD sector access, the neighbor information of W unvisited points can be loaded in one batch. W should not be set too large or too small. Setting W too large wastes computational resources and SSD bandwidth, while setting it too small increases search latency. Caching Frequently Visited Vertices: Aims to reduce disk access count. Cache all points within C hops from the starting point in memory. The value of C is best set between 3 and 4. Implicit Re-Ranking Using Full-Precision Vectors: Since PQ is lossy compression, PQ-based distance algorithms only approximate the actual distance. To eliminate this discrepancy, we store the distance from each point to all its neighbors — this is full-precision. As for the implementation principle, in simple terms, it also leverages disk loading efficiency. Based on the paper, DiskANN\u0026rsquo;s execution efficiency and recall outperform IVF and HNSW:\nReferences # Original article (Chinese): 向量数据库相关概念\nHarnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nChih-Hao Liu 66 Classic LLM Papers\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nA Survey of Large Language Models\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n一文讲清楚，AI、AGI、AIGC与AIGC、NLP、LLM，ChatGPT等概念\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Prompt_engineering\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nRAG original paper\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nJonathan Katz pgconfdev2024 Vectors: How to better support a nasty data type\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOpenAI recommends using vector databases https://openai.com/index/chatgpt-plugins/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVector databases (1): What makes each one different?\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVector database performance comparison\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Vector_(mathematics_and_physics)\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Unit_vector\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOpenAI on unit vector usage https://platform.openai.com/docs/guides/embeddings/frequently-asked-questions\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPinecone Natural Language Processing for Semantic Search https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYao Yuan A Casual Discussion of Various Spaces in Mathematics\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Euclidean_distance\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Taxicab_geometry\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Minkowski_distance\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Sine_and_cosine\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nJonathan Katz pgconfeu2023 Vectors are the new JSON\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Jaccard_index\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVyacheslav Efimov Similarity Search, Part 5: Locality Sensitive Hashing (LSH)\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Hamming_distance\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVyacheslav Efimov Similarity Search, Part 6: Random Projections with LSH Forest ↩\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nearthwjl Delaunay Triangulation Study Notes\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Delaunay_triangulation\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Voronoi_diagram\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://en.wikipedia.org/wiki/Precision_and_recall\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nJianshu LSH (Locality Sensitive Hashing) Algorithm\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nHNSW Original Paper\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVyacheslav Efimov Similarity Search, Part 4: Hierarchical Navigable Small World (HNSW)\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://www.pinecone.io/learn/series/faiss/vector-indexes/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVyacheslav Efimov Similarity Search, Part 1: kNN \u0026amp; Inverted File Index\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nVyacheslav Efimov Similarity Search, Part 2: Product Quantization\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPQ Original Paper\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPinecone Faiss Manual https://www.pinecone.io/learn/series/faiss/product-quantization/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDiskANN Original Paper\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDiskANN, A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset https://milvus.io/blog/2021-09-24-diskann.md\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/vector-database-core-concepts/","section":"Posts","summary":"Vector Database Core Concepts # A Bit of History # The development history of LLM models, from Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond1:\nMany people only gradually learned about large models after the ChatGPT explosion, but in the years before that tipping point, the development of large models had already begun a war of the gods. Several institutions published many revolutionary papers — on the corporate side: Google, DeepMind, OpenAI, Meta, Microsoft; on the academic side: Stanford, Berkeley, CMU, Princeton, MIT2.\n","title":"Vector Database Core Concepts","type":"posts"},{"content":" VACUUM Truncate # TRUNCATE\u0026mdash;Specifies that VACUUM should attempt to truncate off any empty pages at the end of the table and allow the disk space for the truncated pages to be returned to the operating system. This is normally the desired behavior and is the default unless the vacuum_truncate option has been set to false for the table to be vacuumed. Setting this option to false may be useful to avoid ACCESS EXCLUSIVE lock on the table that the truncation requires. This option is ignored if the FULL option is used.\nAKA, the truncate option in VACUUM is enabled by default. It removes empty pages at the end of the table, acquiring an AccessExclusiveLock (level 8 lock) on the table during the operation.\nToday I found that in a certain environment, after deleting all data with DELETE FROM, neither autovacuum nor manual VACUUM reclaimed the space.\nReproducing the issue:\ncreate table lzl1(a int); insert into lzl1 select generate_series(1,1000) a; analyze lzl1; lzldb=# select relname,relpages,reltuples from pg_class where relname=\u0026#39;lzl1\u0026#39;; relname | relpages | reltuples ---------+----------+----------- lzl1 | 5 | 1000 relpages is 5, so the last page number is 4.\nlzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | substring --------+----+-----------+--------+--------+-------+-----------------------------------------+----------------+------------ (4,1) | 1 | LP_NORMAL | 772 | 0 | 0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} | \\x89030000 (4,2) | 2 | LP_NORMAL | 772 | 0 | 0 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} | \\x8a030000 ... delete from lzl1; vacuum lzl1; lzldb=\u0026gt; select t_ctid,lp,case lp_flags when 0 then \u0026#39;0:LP_UNUSED\u0026#39; when 1 then \u0026#39;LP_NORMAL\u0026#39; when 2 then \u0026#39;LP_REDIRECT\u0026#39; when 3 then \u0026#39;LP_DEAD\u0026#39; end as lp_flags,t_xmin,t_xmax,t_field3 as t_cid, raw_flags, info.combined_flags,substring(t_data,0,40) from heap_page_items(get_raw_page(\u0026#39;lzl1\u0026#39;,4)) item,LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) info order by lp; t_ctid | lp | lp_flags | t_xmin | t_xmax | t_cid | raw_flags | combined_flags | substring --------+----+-------------+--------+--------+-------+-----------+----------------+----------- | 1 | 0:LP_UNUSED | | | | | | lzldb=# select relname,relpages,reltuples from pg_class where relname=\u0026#39;lzl1\u0026#39;; relname | relpages | reltuples ---------+----------+----------- lzl2 | 5 | 0 It looks like all dead tuples were reclaimed, but the space is still occupied — the pages were not freed. Why doesn\u0026rsquo;t it truncate when the table is completely empty? Let\u0026rsquo;s dig into this question.\nSource Code Analysis of should_attempt_truncation # (Unless otherwise noted, the version referenced is PG 11.)\nIn vacuumlazy.c there\u0026rsquo;s a pithily named function should_attempt_truncation — this is the function that decides whether truncation is needed:\nstatic bool should_attempt_truncation(LVRelStats *vacrelstats) { BlockNumber possibly_freeable; possibly_freeable = vacrelstats-\u0026gt;rel_pages - vacrelstats-\u0026gt;nonempty_pages; if (possibly_freeable \u0026gt; 0 \u0026amp;\u0026amp; (possibly_freeable \u0026gt;= REL_TRUNCATE_MINIMUM || possibly_freeable \u0026gt;= vacrelstats-\u0026gt;rel_pages / REL_TRUNCATE_FRACTION) \u0026amp;\u0026amp; old_snapshot_threshold \u0026lt; 0) return true; else return false; } Where:\n#define REL_TRUNCATE_MINIMUM 1000 #define REL_TRUNCATE_FRACTION 16 So the conditions for truncation are:\nNumber of empty trailing pages \u0026gt; 1000, or number of empty trailing pages \u0026gt; 1/16 of total pages old_snapshot_threshold \u0026lt; 0 The first rule exists to avoid constantly truncating tiny bits of trailing empty pages — reclaiming that negligible space isn\u0026rsquo;t worth the time and the AccessExclusiveLock. It\u0026rsquo;s unnecessary.\nThe second rule is explained as follows:\n* Also don\u0026#39;t attempt it if we are doing early pruning/vacuuming, because a * scan which cannot find a truncated heap page cannot determine that the * snapshot is too old to read that page. We might be able to get away with * truncating all except one of the pages, setting its LSN to (at least) the * maximum of the truncated range if we also treated an index leaf tuple * pointing to a missing heap page as something to trigger the \u0026#34;snapshot too * old\u0026#34; error, but that seems fragile and seems like it deserves its own patch * if we consider it. \u0026ldquo;Because VACUUM scanning cannot yet confirm whether page data has snapshot-too-old issues, and there are LSN and index page complications, the code logic looks fiddly. If this feature is needed, a dedicated patch would be required.\u0026rdquo;\nOK, so it looks like the code simply doesn\u0026rsquo;t check whether a page actually has snapshot-too-old issues. It takes the blunt approach of checking old_snapshot_threshold \u0026lt; 0 — the database itself must have snapshot-too-old disabled before truncation is attempted.\nGoing back to the earlier problem where VACUUM didn\u0026rsquo;t reclaim space: since DELETE removed all data, the condition \u0026ldquo;empty trailing pages \u0026gt; 1/16 of total pages\u0026rdquo; was definitely satisfied. However, old_snapshot_threshold was actually enabled in that environment:\nlzldb=\u0026gt; show old_snapshot_threshold ; old_snapshot_threshold ------------------------ 1h Disabling old_snapshot_threshold and then doing the delete-all + VACUUM will reclaim the space. Disabling old_snapshot_threshold requires a database restart.\n-- After restart lzldb=\u0026gt; show old_snapshot_threshold ; old_snapshot_threshold ------------------------ -1 lzldb=\u0026gt; select pg_relation_filepath(\u0026#39;lzl1\u0026#39;); pg_relation_filepath ---------------------- base/16384/16446 lzldb=\u0026gt; vacuum lzl1; -- Pages successfully reclaimed lzldb=# select relname,relpages,reltuples from pg_class where relname=\u0026#39;lzl1\u0026#39;; relname | relpages | reltuples ---------+----------+----------- lzl1 | 0 | 0 -- Table not rebuilt lzldb=\u0026gt; select pg_relation_filepath(\u0026#39;lzl1\u0026#39;); pg_relation_filepath ---------------------- base/16384/16446 All pages successfully reclaimed, table not rebuilt. Problem located.\nBut to understand the VACUUM truncation mechanism more deeply, let\u0026rsquo;s continue to the next section.\nSource Code Analysis of lazy_truncate_heap # Relying solely on should_attempt_truncation to judge truncation isn\u0026rsquo;t rigorous enough. We also need to look at lazy_truncate_heap, the function that actually performs truncation, which has additional checks:\n/* * lazy_truncate_heap - try to truncate off any empty pages at the end */ static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) { BlockNumber old_rel_pages = vacrelstats-\u0026gt;rel_pages; BlockNumber new_rel_pages; int\tlock_retry; /* Report that we are now truncating */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_TRUNCATE); /* * Loop until no more truncating can be done. */ do { PGRUsage\tru0; pg_rusage_init(\u0026amp;ru0); /* * We need full exclusive lock on the relation in order to do * truncation. If we can\u0026#39;t get it, give up rather than waiting --- we * don\u0026#39;t want to block other backends, and we don\u0026#39;t want to deadlock * (which is quite possible considering we already hold a lower-grade * lock). */ Vacrelstats-\u0026gt;lock_waiter_detected = false; lock_retry = 0; while (true) { // If we can acquire the lock, break out of while if (ConditionalLockRelation(onerel, AccessExclusiveLock)) break; /* * Check for interrupts while trying to (re-)acquire the exclusive * lock. */ CHECK_FOR_INTERRUPTS(); // If lock not immediately acquired, initially (++lock_retry)=1, \u0026lt;=100; // when \u0026gt;100, give up truncation and return if (++lock_retry \u0026gt; (VACUUM_TRUNCATE_LOCK_TIMEOUT / VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL)) { /* * We failed to establish the lock in the specified number of * retries. This means we give up truncating. */ Vacrelstats-\u0026gt;lock_waiter_detected = true; ereport(elevel, (errmsg(\u0026#34;\\\u0026#34;%s\\\u0026#34;: stopping truncate due to conflicting lock request\u0026#34;, RelationGetRelationName(onerel)))); return; } // Sleep 50ms. Looks a bit crude. Theoretical max wait: 50*100=5s pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L); } // After acquiring the exclusive lock, check if new tuples arrived during VACUUM. // If so, don\u0026#39;t truncate. new_rel_pages = RelationGetNumberOfBlocks(onerel); if (new_rel_pages != old_rel_pages) { UnlockRelation(onerel, AccessExclusiveLock); return; } new_rel_pages = count_nondeletable_pages(onerel, vacrelstats); // If new tuples were written during VACUUM, don\u0026#39;t truncate if (new_rel_pages \u0026gt;= old_rel_pages) { /* can\u0026#39;t do anything after all */ UnlockRelation(onerel, AccessExclusiveLock); return; } /* * Okay to truncate. */ RelationTruncate(onerel, new_rel_pages); // Release lock immediately after truncation UnlockRelation(onerel, AccessExclusiveLock); ... } while (new_rel_pages \u0026gt; vacrelstats-\u0026gt;nonempty_pages \u0026amp;\u0026amp; vacrelstats-\u0026gt;lock_waiter_detected); } Where:\n#define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* microseconds!! */ #define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* microseconds!! */ The main function actually called is RelationTruncate. The bulk of the preceding code is all about trying to acquire the AccessExclusiveLock. Beyond the two conditions mentioned earlier, truncation also won\u0026rsquo;t happen in these two cases:\nFailed to acquire AccessExclusiveLock New data was written during the VACUUM VACUUM Truncate May Wait Up to 5 Seconds # While reading the lazy_truncate_heap source code above, I noticed the lock acquisition retry loop has a somewhat crude wait:\npg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L); Each loop iteration sleeps 50ms. The theoretical maximum wait is 50×100 = 5 seconds!\nLet\u0026rsquo;s test this wait time:\nWindow 1 Window 2 create table lzl2; alter table lzl2 set (autovacuum_enabled=off);; insert into lzl2 select generate_series(1,1000) a; delete from lzl2; begin; select * from lzl2; \\timing vacuum lzl2; \u0026ndash; Time: 5022.122 ms (00:05.022) We can see the wait time is about 5 seconds.\nIf you\u0026rsquo;re fast enough, you can open a third window and grab a pstack of session 2:\n[postgres@cncq081298 lzl]$ pstack 4113 #0 0x00002b92a978c013 in __select_nocancel () from /lib64/libc.so.6 #1 0x000000000086225a in pg_usleep (microsec=microsec@entry=50000) at pgsleep.c:56 #2 0x00000000005e8212 in lazy_truncate_heap (vacrelstats=0xfc4490, onerel=0x2b92a8bc88d8) at vacuumlazy.c:1861 #3 lazy_vacuum_rel (onerel=onerel@entry=0x2b92a8bc88d8, options=options@entry=5, params=params@entry=0x7ffc96bb31d0, bstrategy=\u0026lt;optimized out\u0026gt;) at vacuumlazy.c:290 #4 0x00000000005e4551 in vacuum_rel (relid=32778, relation=\u0026lt;optimized out\u0026gt;, options=options@entry=5, params=params@entry=0x7ffc96bb31d0) at vacuum.c:1572 #5 0x00000000005e55ac in vacuum (options=5, relations=0xfc6540, params=params@entry=0x7ffc96bb31d0, bstrategy=\u0026lt;optimized out\u0026gt;, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:340 ... It reached pg_usleep inside lazy_truncate_heap, passing entry=50000 microsec. In reality, pg_usleep looped 100 times, total wait time 50000×100 microseconds = 5 seconds.\nLater, in PG 15, this code was improved by replacing pg_usleep with WaitLatch:\n(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL, WAIT_EVENT_VACUUM_TRUNCATE); ResetLatch(MyLatch); VACUUM Truncate Summary # Conditions for VACUUM to trigger truncation (all must be met):\nEmpty trailing pages \u0026gt; 1000, or empty trailing pages \u0026gt; 1/16 of total pages old_snapshot_threshold \u0026lt; 0 Before PG 15 (exclusive): must acquire AccessExclusiveLock within 5 seconds No new data written during the VACUUM This article was originally published in Chinese on lastdba.com.\n","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/when-does-vacuum-truncate-empty-pages-at-the-end-of-a-table/","section":"Posts","summary":"VACUUM Truncate # TRUNCATE—Specifies that VACUUM should attempt to truncate off any empty pages at the end of the table and allow the disk space for the truncated pages to be returned to the operating system. This is normally the desired behavior and is the default unless the vacuum_truncate option has been set to false for the table to be vacuumed. Setting this option to false may be useful to avoid ACCESS EXCLUSIVE lock on the table that the truncation requires. This option is ignored if the FULL option is used.\n","title":"When Does VACUUM Truncate Empty Pages at the End of a Table?","type":"posts"},{"content":" Analyzing Slow CREATE TABLE.. PARTITION OF Statements # 2024-05-16 22:02:59.063 CST,\u0026#34;user1\u0026#34;,\u0026#34;dblzl\u0026#34;,125889,\u0026#34;30.88.79.3:37423\u0026#34;,66461213.1ebc1,2,\u0026#34;authentication\u0026#34;,2024-05-16 22:02:59 CST,34/41364668,0,LOG,00000,\u0026#34;connection authorized: user=user1 database=dblzl\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;client backend\u0026#34; 2024-05-16 22:02:59.079 CST,\u0026#34;user1\u0026#34;,\u0026#34;dblzl\u0026#34;,125889,\u0026#34;30.88.79.3:37423\u0026#34;,66461213.1ebc1,3,\u0026#34;idle\u0026#34;,2024-05-16 22:02:59 CST,34/41364669,0,LOG,00000,\u0026#34;statement: -- a86fae372f73414bbe1af18213a47beb /*a86fae372f73414bbe1af18213a47beb */ create table if not exists table1_partition_p2406 partition of table1 for values from (\u0026#39;2024-06-01 00:00:00\u0026#39;) to (\u0026#39;2024-07-01 00:00:00\u0026#39;); \u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;client backend\u0026#34; ... 2024-05-16 22:38:28.555 CST,\u0026#34;user1\u0026#34;,\u0026#34;dblzl\u0026#34;,125889,\u0026#34;30.88.79.3:37423\u0026#34;,66461213.1ebc1,4,\u0026#34;CREATE TABLE\u0026#34;,2024-05-16 22:02:59 CST,34/0,0,LOG,00000,\u0026#34;duration: 2129483.549 ms\u0026#34;,,,,,,,,,\u0026#34;\u0026#34;,\u0026#34;client backend\u0026#34; The user \u0026lsquo;user1\u0026rsquo; connected to the database at 22:02:59 and immediately executed a create table.. partition of.. statement, which didn\u0026rsquo;t complete until 22:38:28. The logs in between are omitted — there was a lot of session blocking information, with session 125889 as the blocking source.\nBlocked sessions looked like:\nprocess 33569 still waiting for RowExclusiveLock on relation 53733 of database 17073 after 1000.048 ms\u0026#34;,\u0026#34;Process holding the lock: 125889. Wait queue: 33569. When PARTITION OF adds a partition, it acquires an AccessExclusiveLock (level 8) on the parent table, which blocks all operations on the partitioned table. Normally, adding a partition via PARTITION OF is very fast, and the lock is released immediately. However, if there\u0026rsquo;s a long-running transaction on the partitioned table, the level 8 lock on the parent table must wait, causing subsequent blocking.\n(Stolen from my own diagram): However, in this case there was no long transaction on the table, yet PARTITION OF took 35 minutes.\nFrom historical process information, this process was in D state (uninterruptible sleep), which was suspicious. Initially, I suspected memory or disk issues, but after investigation, everything was normal.\nHowever, this problem was easy to reproduce — running create table partition of directly in a simulation environment was very slow. pg_stat_activity showed the statement waiting on IO:\nwait_event_type | IO wait_event | DataFileRead state | active query | create table xxx partition of xx for values from (\u0026#39;2025-05-01 00:00:00\u0026#39;) to (\u0026#39;2025-06-01 00:00:00\u0026#39;); strace tracing revealed the process was heavily reading one file:\npread64(53, \u0026#34;\\22\\2\\0\\0\\220w\\321\u0026gt;\\0\\0\\5\\0\\24\\0018\\1\\0 \\4 \\0\\0\\0\\0\\200\\237\\0\\1\\310\\236p\\1\u0026#34;..., 8192, 863485952) = 8192 Using file descriptor 53, we identified the file:\n[/proc/356174/fd] ll |grep 53 lrwx------ 1 postgres postgres 64 May 17 15:34 53 -\u0026gt; /lzl/pglzl/data/base/17076/25883 oid2name -d lzldb -f 25883 From database \u0026#34;lzldb\u0026#34;: Filenode Table Name ----------------------------------------------- 25883 table_partition_default Finally located: the table table_partition_default:\n=\u0026gt; \\d+ table_partition_default ... Partition of: table_partition_default DEFAULT Partition constraint: (NOT ((date_created IS NOT NULL) AND ((date_created \u0026lt; \u0026#39;2022-05-01 00:00:00\u0026#39;::timestamp without time zone) OR ((date_created \u0026gt;= \u0026#39;2022-05-01 00:00:00\u0026#39;::timestamp without time zone) AND (da =\u0026gt; \\dt+ table_partition_default List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+------------------------------------+-------+------------+-------------+-------+------------- public | table_partition_default | table | user1 | permanent | 50 GB | (1 row) It was the default partition table, with tens of GB of data. Oracle DBAs might find this unfamiliar — PG\u0026rsquo;s default partition receives data that doesn\u0026rsquo;t fall into any defined partition range. The default partition ensures data is still accepted even if no matching range is defined.\nIf data exists in the default partition and a new partition needs to cover that range, what happens? It directly throws an error:\n=\u0026gt; create table if not exists table_partition_pxxxx partition of table_partition for values from (\u0026#39;2023-01-12 00:00:00\u0026#39;) to (\u0026#39;2023-01-13 00:00:00\u0026#39;); ERROR: 23514: updated partition constraint for default partition \u0026#34;table_partition_default\u0026#34; would be violated by some row SCHEMA NAME: public TABLE NAME: table_partition_default LOCATION: check_default_partition_contents, partbounds.c:3227 As you can see, when adding a child partition, the default partition\u0026rsquo;s partition constraint is automatically modified. The default partition constraint check is essentially validating the default partition\u0026rsquo;s data against the new partition\u0026rsquo;s range.\nAt this point, the cause is clear:\nWhen adding a new child partition to a partitioned table, the partition creation statement needs to validate data in the default partition to ensure the new partition\u0026rsquo;s data range doesn\u0026rsquo;t conflict with existing default partition data. This caused CREATE TABLE PARTITION OF to read a massive amount of default partition data, preventing the new partition from being created. The blocking then cascaded, making business data unqueryable and unwritable.\nSummary and Recommendations # PostgreSQL partitioned tables are becoming increasingly common. Maintaining partitions requires attention to many details. I recommend reading PostgreSQL Partitioned Tables, which covers almost everything.\nIn this case, the key to resolution is the data in the default partition. Before refactoring the default partition, do not use PARTITION OF to create child partitions.\nDefault partition refactoring plan:\nDetach the default child partition, then properly create child partitions, and reinsert the default table data back into the partitioned table. If necessary, after detaching and creating proper child partitions, create an empty default partition to maintain business data continuity. Note that detach differs from attach — detach requires a level 8 lock on the parent table. PG14 supports DETACH CONCURRENTLY. If you don\u0026rsquo;t refactor the default partition, check the current data range in the default partition. Using ATTACH to add child partitions will be slow, but won\u0026rsquo;t block reads and writes.\nFinally, a review of best practices for adding partitions:\nPARTITION OF requires a level 8 lock on the parent table, which carries risk. The recommended approach is to use ATTACH to add new child partitions (partition indexes can be handled similarly). This does not block reads and writes, has no business impact, and can be done online.\nThe correct approach for adding new partitions:\nCREATE TABLE lzlpartition1_202303 (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS); alter table LZLPARTITION1 attach partition LZLPARTITION1_202303 for values from (\u0026#39;2023-03-01 00:00:00\u0026#39;) to (\u0026#39;2023-04-01 00:00:00\u0026#39;); If the new partition already has data, ATTACH may still be slow. You can optimize by pre-creating constraints:\nThe correct approach for adding a partition that already has data:\n-- Reduce verbose DDL by using LIKE CREATE TABLE lzlpartition1_202303 (LIKE lzlpartition1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS); -- Skip this step if no data exists. Add a CHECK constraint referencing other partitions\u0026#39; Partition constraint to reduce ATTACH constraint validation time. alter table lzlpartition1_202303 add constraint chk_202303 CHECK ((date_created IS NOT NULL) AND (date_created \u0026gt;= \u0026#39;2023-03-01 00:00:00\u0026#39;::timestamp without time zone) AND (date_created \u0026lt; \u0026#39;2023-04-01 00:00:00\u0026#39;::timestamp without time zone)); -- Add partition via ATTACH alter table LZLPARTITION1 attach partition LZLPARTITION1_202303 for values from (\u0026#39;2023-03-01 00:00:00\u0026#39;) to (\u0026#39;2023-04-01 00:00:00\u0026#39;); -- Optional. Before transactions occur on the new partition, drop the extra CHECK constraint alter table lzlpartition1_202303 drop constraint chk_202303; ","date":"Aug 12, 2024","externalUrl":null,"permalink":"/en/2024/08/12/why-is-partition-of-slow-when-theres-no-blocking/","section":"Posts","summary":"Analyzing Slow CREATE TABLE.. PARTITION OF Statements # 2024-05-16 22:02:59.063 CST,\"user1\",\"dblzl\",125889,\"30.88.79.3:37423\",66461213.1ebc1,2,\"authentication\",2024-05-16 22:02:59 CST,34/41364668,0,LOG,00000,\"connection authorized: user=user1 database=dblzl\",,,,,,,,,\"\",\"client backend\" 2024-05-16 22:02:59.079 CST,\"user1\",\"dblzl\",125889,\"30.88.79.3:37423\",66461213.1ebc1,3,\"idle\",2024-05-16 22:02:59 CST,34/41364669,0,LOG,00000,\"statement: -- a86fae372f73414bbe1af18213a47beb /*a86fae372f73414bbe1af18213a47beb */ create table if not exists table1_partition_p2406 partition of table1 for values from ('2024-06-01 00:00:00') to ('2024-07-01 00:00:00'); \",,,,,,,,,\"\",\"client backend\" ... 2024-05-16 22:38:28.555 CST,\"user1\",\"dblzl\",125889,\"30.88.79.3:37423\",66461213.1ebc1,4,\"CREATE TABLE\",2024-05-16 22:02:59 CST,34/0,0,LOG,00000,\"duration: 2129483.549 ms\",,,,,,,,,\"\",\"client backend\" The user ‘user1’ connected to the database at 22:02:59 and immediately executed a create table.. partition of.. statement, which didn’t complete until 22:38:28. The logs in between are omitted — there was a lot of session blocking information, with session 125889 as the blocking source.\n","title":"Why Is 'partition of' Slow When There's No Blocking?","type":"posts"},{"content":"​ I just finished watching the Yellowstone series and decided to write a bit about the American shows I\u0026rsquo;ve watched recently — eleven in total. Here\u0026rsquo;s a quick review of each.\nYellowstone # Yellowstone is already at its fifth season, and it looks like they\u0026rsquo;ll keep going. When I first started watching, I genuinely got hooked — a beautiful, grand series with stunning cinematography and gorgeous scenery. Plus, you get to see how real American ranchers herd cattle — actual ranchers really do have that old-money landowner vibe\u0026hellip;\nSeason one\u0026rsquo;s plot holds up fine — the dynamics between the Dutton family, the Native Americans, the state government, and the developers work well, and you can casually enjoy watching cowboys herd cattle along the way. But the plot in later seasons\u0026hellip; is unexpectedly bad. Downright incomprehensible. It lowers the bar for screenwriting.\nZooming into the show\u0026rsquo;s core: why do so many people love this series? Because Yellowstone doesn\u0026rsquo;t just depict authentic cowboy life (they even filmed some genuine ranch cowboy life later on) — it also reflects the harsh reality that old ranches can barely survive under modern societal development. And cowboy culture and private land are the very heart of American identity. It\u0026rsquo;s not just the Dutton family stubbornly trying to preserve the ranching way of life — it almost feels like a clash between urban American development and native cultural preservation.\nI can responsibly say: the plot definitely gets worse with each season — so bad that the main storyline becomes unwatchable. But if they release more seasons, this show will still be my top priority over everything else.\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️\n1923 # A Yellowstone prequel series. Maybe because Yellowstone is so famous, this prequel 1923 ended up with a bit too many stars — it left a bad impression from the start. Perhaps the creators thought Yellowstone\u0026rsquo;s plot wasn\u0026rsquo;t good enough, and that a show purely about cowboys would be hard to craft a compelling script for, so they added two subplots to 1923. But adding subplots created another problem: the show doesn\u0026rsquo;t feel enough like Yellowstone. Constant cutting between storylines — no \u0026ldquo;slow-paced\u0026rdquo; Yellowstone vibe.\nThe Native American girl\u0026rsquo;s storyline seems completely disconnected from the main plot — no idea when it\u0026rsquo;ll tie in. But this Native girl subplot is actually pretty good. Native lands were stolen, and their children were sent to boarding schools to be forcibly indoctrinated with white Christian beliefs. This subplot genuinely carries the Yellowstone spirit. The Native characters are cold-blooded killers too — none of that \u0026ldquo;bullet in the body but still politicking\u0026rdquo; dissonance. The narrative flows smoothly without dragging; this subplot is quite watchable.\nAs for the Africa subplot\u0026hellip; while they do capture some scenery, it\u0026rsquo;s just not as good as the Dutton ranch — doesn\u0026rsquo;t have that feeling. And once they leave Africa, it starts dragging, heavily focusing on a grand romance set against the era\u0026rsquo;s backdrop — but what does that have to do with Yellowstone? And this storyline waited an entire season without converging into the main plot\u0026hellip; An entire season of setup for one character, framed as \u0026ldquo;the Dutton ranch\u0026rsquo;s hope rests on him\u0026rdquo; — the stakes are too high, and the subplot itself isn\u0026rsquo;t that compelling. Season two is highly likely to be a massive flop.\nThe early part of 1923 still had some ranch-versus-the-tide-of-history flavor. Later it\u0026rsquo;s pure padding — they don\u0026rsquo;t even film cattle herding anymore. Completely devoid of interest. Can\u0026rsquo;t even muster a decent fight. Kind of bad. Only eight episodes in the whole season, and the plot starts falling apart halfway through — didn\u0026rsquo;t learn anything from Yellowstone except how to botch the ending.\nYou can tell this show wanted to inherit Yellowstone but also try something new — depicting that era\u0026rsquo;s America and Europe (even colonial Africa) — but ended up being a mess of everything and nothing. If you want to revisit that era, I recommend Boardwalk Empire, which is set around the same time (Prohibition era) and has far more period atmosphere than this show.\nPersonal rating: ⭐️⭐️⭐️\nRecommended: ⭐️⭐️\n1883 # 1883 — a grand, tragic Western epic. A Yellowstone series entry, the prequel to the prequel. It feels like watching an epic saga, leaving you wanting more. It\u0026rsquo;s no longer just a simple TV show — the cinematography even has literary and artistic qualities, while also carrying a slice of American pioneering history. The U.S. had just emerged from the Civil War, everything was waiting to be rebuilt\u0026hellip;\nI personally really enjoy shows like Yellowstone — the filming style suits my taste. But the main series plot is aggressively terrible; I\u0026rsquo;d rather just watch them ride horses on the ranch and skip the main storyline entirely. 1883 fills that gap perfectly — not too much complex plot (but not too little either), just right. Look at the valley, look at the horses, add some epic BGM, and the immersion is strong.\nThe entire 1883 series doesn\u0026rsquo;t actually have much plot, but it tells a very complete story. America had just ended its Civil War, in an era of lawlessness — cowboys, bandits, sheriffs, European immigrants, Native Americans\u0026hellip; There\u0026rsquo;s some classic cowboy shootout action, but the focus is more on cowboy life and immigrants\u0026rsquo; yearning for freedom. Yet the road to freedom is full of hardship: horse thieves, Native tribes, rattlesnakes, tornadoes, and this unforgiving land. A deeply profound show. Other than the female lead\u0026rsquo;s runny nose being a minus, there\u0026rsquo;s nothing to criticize. The plot is that rare combination of complete and perfectly proportioned. Very, very highly recommended.\nHere\u0026rsquo;s a favorite line describing cowboys:\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️⭐️\nTulsa King # A pure thrill ride. Starring 70-something Sylvester Stallone as an old-school mobster who\u0026rsquo;s been locked up for decades, now reasserting order over a small city\u0026rsquo;s underworld. \u0026ldquo;It\u0026rsquo;s not that I can\u0026rsquo;t adapt — it\u0026rsquo;s that people today have messed-up rules.\u0026rdquo; Us old-school gangsters follow a code~ The plot has no real flaws, no dragging — just pure entertainment. Not sure if they\u0026rsquo;ll keep making more.\nPersonal rating: ⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️⭐️\nWednesday # A fun watch, pretty decent. I\u0026rsquo;d never seen a gothic Lolita-style American show before, and it looks pretty good. The early parts are quite engaging and fresh. Later, when it leans into mystery, it falls off — everyone can tell who\u0026rsquo;s behind it, except Wednesday (the main character)\u0026hellip; (A lot of American mystery shows are like this — start strong, then gradually fall apart.) If you\u0026rsquo;ve never tried the gothic Lolita style, give it a shot.\nPersonal rating: ⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️\nThe Last of Us # Adapted from the video game of the same name — which I somehow never played! Precisely because I hadn\u0026rsquo;t played it, I could watch the show with a calm mind. Starring the hugely popular Lyanna Mormont (Bella Ramsey) and Oberyn Martell (Pedro Pascal) from Game of Thrones — both deliver smooth, natural performances. It\u0026rsquo;s a post-apocalyptic zombie-type show, but the zombies aren\u0026rsquo;t from a virus — they\u0026rsquo;re from a fungal infection. The zombies\u0026rsquo; brains are full of fungus. One memorable scene: Bella\u0026rsquo;s character cuts open the head of a zombie stuck between rocks, and the fungus inside spills out — still alive. Maybe because of the fungus element, it\u0026rsquo;s more satisfying than the average zombie show. The visuals are great — not dark and murky, and not overly disgusting. A complete, well-told story with excellent cinematography. There\u0026rsquo;s one segment near the end that personally left me with some psychological discomfort, but overall the plot absolutely holds up. Several smaller storylines are beautifully told. Very good overall, highly recommended.\nPersonal rating: ⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️⭐️\nBoardwalk Empire # A series spanning four seasons, now complete. Set in 1920s America, right after Prohibition was enacted. Women stood outside bars calling for their rights; politicians publicly supported women while privately running bootlegging operations; gangsters stepped out of cars in trench coats, Thompson submachine guns blazing\u0026hellip; Boardwalk Empire is about the gangster empire of Atlantic City (just below New York), built on bootlegging into a wealth rivaling nations. I imagine many have seen Once Upon a Time in America — you can roughly think of this show as its TV series counterpart. This one is hard to summarize — let\u0026rsquo;s go season by season.\nSeason one is god-tier. Plenty of risqué scenes, and the plot isn\u0026rsquo;t just smooth — it\u0026rsquo;s miraculous. Women, black communities, bootlegging, jazz, gang wars, WWI veterans\u0026hellip; Gangsters have essentially seized control of the city — even the newspapers don\u0026rsquo;t care what the mayor says.\nSeason two is a direct continuation of season one — also excellent.\nSeason three introduces problems. It doesn\u0026rsquo;t feel like a continuation of the first two seasons (though some plot threads connect) — it could almost stand alone. Is the plot bad? Yes, it\u0026rsquo;s disconnected. But is it terrible? Taken on its own, it\u0026rsquo;s not flawed — it\u0026rsquo;s even somewhat entertaining. This season has many brilliant segments: Half-Face taking on ten men alone, the jaw-dropping plotline of the formidable madam, extended solo blues performances by black characters — all superb!\nSeason four is full of issues. I thought my favorite character, dormant for three seasons, would finally take center stage and do something meaningful — instead, he was hastily written off. Dear writers, if that\u0026rsquo;s how it was going to be, could you not have put him on the poster? Made it seem like something big was coming — got my hopes up for nothing\u0026hellip; Season four\u0026rsquo;s protagonist has risen too high, making it hard to drive the plot (you could already feel this in season three). The only highlight of season four is the protagonist\u0026rsquo;s childhood flashbacks — a perfect closure to his arc.\nMany characters\u0026rsquo; later arcs are unsatisfying, but many characters\u0026rsquo; mid-series arcs are just too brilliant\u0026hellip; Although this show isn\u0026rsquo;t hugely popular, it did win awards, and you can see many scenes being referenced by later, higher-profile American shows. For example, Gus Fring\u0026rsquo;s arc in Breaking Bad borrows from Half-Face; King Tommen\u0026rsquo;s suicide in Game of Thrones \u0026ldquo;completely\u0026rdquo; borrows from the butler\u0026rsquo;s suicide\u0026hellip;\nI really love this show — it immerses you in the glamorous cities of that era, the decadent urban life, the jazz of underground speakeasies, the gangsters\u0026hellip; A narrative that holds nothing back (I mean that about everything). The series as a whole is excellent, rich with period atmosphere.\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️\nBand of Brothers # I\u0026rsquo;m sure many have heard of this show\u0026rsquo;s reputation. Yes — I somehow hadn\u0026rsquo;t seen it. My elementary-school-level writing ability and limited education prevent me from offering any meaningful critique. Only one word can describe it: divine. I\u0026rsquo;ll find a chance to watch it again~\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️⭐️\nThe Pacific # The Pacific was made shortly before Band of Brothers. It\u0026rsquo;s actually a very good show, but then that monster came along, and this one\u0026rsquo;s reputation never reached the same heights. Band of Brothers covers the European theater of WWII; this show covers the Pacific theater. Strangely, the two shows mirror their respective theaters — the European theater is far better known, and the shows follow suit\u0026hellip; Even within the show, at the same dinner table, a European theater soldier shows off a captured Nazi banner while the Pacific theater soldier has nothing to show — a touch of melancholy.\nThough not as famous, this is a very, very highly recommended show.\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️⭐️\nThe Mandalorian # The Mandalorian is already at its third season. Also starring Pedro Pascal, also a dad-with-kid storyline\u0026hellip; The first two seasons were quite good and fairly popular. This third season? Not so much. The Mandalorian should probably remain a ronin-like figure driving the plot forward — a whole group of Mandalorians building a homeland just doesn\u0026rsquo;t feel right. The protagonist\u0026rsquo;s identity even gets a bit diluted. (Run, man — take the kid and adventure across the galaxy — isn\u0026rsquo;t that better?)\nMy appreciation for this show is premised on liking the Star Wars universe. In China, Star Wars fans are genuinely rare. If you\u0026rsquo;re not into it, you probably won\u0026rsquo;t get through it — feel free to skip.\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️\nThe White Tower (Shiroi Kyoto) # This one is a Japanese drama. I want to end with it, because it truly is exceptional — near perfect. Though it\u0026rsquo;s somewhat old, it never feels boring while watching. Many ideas are surprisingly forward-thinking, the plot rises and falls dramatically, good and evil are never absolute, and several female characters are beautifully drawn. You\u0026rsquo;ll see some classic love triangles and plot twists, and revisiting them is still quite rewarding. Professor Zaizen\u0026rsquo;s final act brings the entire series to a perfect close. Japanese drama — number one!\nPersonal rating: ⭐️⭐️⭐️⭐️⭐️\nRecommended: ⭐️⭐️⭐️⭐️⭐️\nClosing # All of these are worth watching, and many are masterpieces. Some shows I couldn\u0026rsquo;t find subtitled versions for, so I watched them raw — like the Yellowstone prequel 1883. Since the dialogue wasn\u0026rsquo;t overly complex, I managed to get through it (the narration is quite sophisticated)\u0026hellip; Marking my first raw viewing.\nThese are basically all the shows I\u0026rsquo;ve watched in the last half year or so, so I\u0026rsquo;m bundling them together. There are many other brilliant shows from earlier that left a deep impression — I\u0026rsquo;ll save that for another time when I\u0026rsquo;m in the mood~\nHoping to find more good shows in the second half of the year.\n​\n","date":"Jun 1, 2023","externalUrl":null,"permalink":"/en/2023/06/01/chatting-about-american-tv-shows-june-2023/","section":"Posts","summary":"​ I just finished watching the Yellowstone series and decided to write a bit about the American shows I’ve watched recently — eleven in total. Here’s a quick review of each.\nYellowstone # Yellowstone is already at its fifth season, and it looks like they’ll keep going. When I first started watching, I genuinely got hooked — a beautiful, grand series with stunning cinematography and gorgeous scenery. Plus, you get to see how real American ranchers herd cattle — actual ranchers really do have that old-money landowner vibe…\n","title":"Chatting About American TV Shows — June 2023","type":"posts"},{"content":" The Last DBA # Hi, I\u0026rsquo;m Zhilong Liu — a PostgreSQL DBA based in China.\nThis blog is where I document my deep dives into PostgreSQL internals, production incident analysis, source code walkthroughs, and paper reviews. I write primarily in Chinese on lastdba.com, and I\u0026rsquo;m building this English section to share key insights with the global PostgreSQL community.\nWhat I Write About # Case Studies — Real production incidents and how they were resolved Internals — PostgreSQL mechanisms explained from first principles Source Code — Deep dives into specific subsystems (vacuum, locking, WAL, planner) Paper Reviews — Academic papers on databases, interpreted for practitioners AI \u0026amp; Databases — AIOps, MCP, and the intersection of AI with database operations Contact # GitHub: liuzhilong62 X (Twitter): @liuzhilong62 Email: liuzhilong62@outlook.com All content is licensed under CC BY-NC-SA 4.0.\n","externalUrl":null,"permalink":"/en/about/","section":"The Last DBA","summary":"The Last DBA # Hi, I’m Zhilong Liu — a PostgreSQL DBA based in China.\nThis blog is where I document my deep dives into PostgreSQL internals, production incident analysis, source code walkthroughs, and paper reviews. I write primarily in Chinese on lastdba.com, and I’m building this English section to share key insights with the global PostgreSQL community.\nWhat I Write About # Case Studies — Real production incidents and how they were resolved Internals — PostgreSQL mechanisms explained from first principles Source Code — Deep dives into specific subsystems (vacuum, locking, WAL, planner) Paper Reviews — Academic papers on databases, interpreted for practitioners AI \u0026 Databases — AIOps, MCP, and the intersection of AI with database operations Contact # GitHub: liuzhilong62 X (Twitter): @liuzhilong62 Email: liuzhilong62@outlook.com All content is licensed under CC BY-NC-SA 4.0.\n","title":"About","type":"page"},{"content":"","externalUrl":null,"permalink":"/en/series/","section":"Series","summary":"","title":"Series","type":"series"}]