论文解读 on Last DBA

Paper Deep Read: Anarchy in the Database

Sat, 03 Jan 2026 00:00:00 +0000

Paper: Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility

GitHub: https://github.com/cmu-db/ext-analyzer

PGConf: The trouble with extensions (PGConf.dev 2025)

Why This Paper
#

This is a survey of database extensions (mainly Postgres), covering the implementation approaches of extensions across different databases, existing problems, and most importantly, compatibility. The most significant finding: an evaluation of over 400 PostgreSQL extensions shows that 16.8% of extensions have compatibility issues with at least one other extension, potentially leading to system failures.

Analysis tools and results are on GitHub; Marco Slot’s presentation is at PGConf.

Extension Categories
#

Extension Classification
#

The extension classification chapter is particularly lengthy — a single diagram actually clarifies everything.

Extensions across 6 databases:

PostgreSQL (1986): Written in C, designed from the beginning as an extensible architecture. Consequently, PostgreSQL has the richest and most diverse extensible ecosystem.
MySQL (1994): Written in C++, best known for its storage engine plugin architecture.
MariaDB (2009): A fork of MySQL, also C++ based, supporting more extensions than the original MySQL.
SQLite (2000): Embedded database written in C, adaptable to various hardware devices and operating systems.
Redis (2009): In-memory key-value store written in C++, uniquely extensible — only supports running above the DBMS key-value storage layer.
DuckDB (2018): Embedded analytical database written in C++, with a rapidly emerging extensible ecosystem.

Flexibility and Security
#

Extension security and flexibility are a trade-off — PG extensions are the most flexible but least secure; Redis is the most secure but least flexible:

How PostgreSQL Extensions Are Typically Implemented
#

PG generally has two ways to implement extensions:

Through handler functions, such as UDFs, UDTs, external tables, storage engines, and index access methods.
Through hooks. Hooks are declared as function pointers in global variables; if a hook is set, it will call these pointers instead of its own code.

Implementations may use both approaches — they’re not mutually exclusive. The other 5 databases have generally similar implementations, but none of them have hook-based implementations.

Extensions may use different implementation approaches, e.g., function + types + index AM — this is the number of extensibility types. From Figure 1, we can see that extensions with 1-3 types are the most common, and the most-used implementation approach is function.

From Table 3, 92.5% of extensions use UDFs — after all, it’s a user-facing feature, easiest to develop with the lowest barrier to entry. The least used is client authentication, as this scenario itself is uncommon.

Extension Code Copy Rate
#

The paper also conducted an interesting survey: the extent to which extension code is copied from built-in code:

Out of 441 extensions, 16.6% — 73 extensions — contain at least one line copied from PG source code. The detailed distribution is shown in the left chart above.

Why are so many extensions copying code? Because:

Some functions in PG source are declared static, only callable within their own file, so they can only be copied.
Due to the extension’s own requirements, functions may need slight adjustments, so they can only be copied and adjusted.

And how much were these copied functions adjusted? See the right chart above.

As can be seen, unmodified copies are actually rare.

In summary, extension code is copied from PG source out of necessity, and the overall copy rate isn’t high.

The Heavyweight! — PG Extension Compatibility
#

This is the most interesting part of the paper: pairwise compatibility testing was conducted on 96 extensions, and testing found that 16.8% of extension pairs are incompatible!

Testing methodology:

Installation. Yes, installation alone can cause problems. The authors tested both A→B and B→A installation orders, hence the asymmetric diagram.
Running the extension’s provided unit tests.
pgbench. Smoke testing. pgbench is of course simple, but good results here can still indicate something.

Among the top 20 least compatible extensions, many commonly-used ones appear:

Common extensions: pg_hint_plan, vector, pg_show_plans, pgsentinel, pg_cron, pg_stat_kcache
Heavy extensions: citus, timescaledb

The fact that such extremely common and star extensions can have such poor compatibility is jaw-dropping.

What’s even more chilling: this is just simple pairwise testing. Running 3-10 extensions should be the production norm, and production environments are far more complex and variable than the paper’s three testing methods.

Finally, the paper identifies the reason for poor extension compatibility: extensions that use more components, extension types, and hooks are more likely to be incompatible with other extensions.

Nitpicking
#

It’s really still about Postgres

The paper’s title says DBMS, but it’s mainly about PG compatibility. MySQL, Redis, etc. compatibility is only covered in the survey, with no experimental data at all. (Though the survey is interesting — you can learn how MySQL and Redis extensions are implemented.)

On the other hand, this paper has a kind of alternative “general-specific-general” feel: “DBMS-Postgres-DBMS” 😅

Insufficient compatibility testing

PG has 400+ extensions, but only 96 were tested for compatibility, and only 1-on-1 compatibility testing, without tests involving 3 or more extensions. The compatibility testing isn’t particularly comprehensive.

Conclusion
#

PG extensions are indeed numerous and flexible — you’d struggle to find functionality that PG extensions don’t support. But the extensions themselves are almost in a state of “anarchy” — both extension development and usage have problems.

From the compatibility results, extension compatibility is quite poor — even the installation order affects compatibility. Multiple extensions also depend on hook execution order; for example, two extensions both requiring themselves to execute last becomes awkward. “Having everything” doesn’t mean “install everything.”

Extension Security Issues
#

PG extensions have virtually no security management, whether from inherently unsafe extensions or user privilege escalation through extensions.

If an extension contains unsafe languages, only the OS can restrict its behavior, not the DBMS.
If an extension can access user space, the OS layer cannot manage it.
Extensions implemented through queries (e.g., UDFs) generally won’t bypass ACL policies. While UDFs are more secure, they’re not absolutely secure, as UDFs with admin privileges can exist.
A single hook may not be restricted by ACL, because in PostgreSQL, ACL is only enforced at the planning and execution layers. PG provides SECURITY LABEL to restrict access control for objects (including extensions).

Philosophical Thoughts on Software Management
#

“If an extension contains unsafe languages, only the OS can restrict its behavior, not the DBMS.”

This statement itself isn’t wrong, but it carries an implication of “your directory could be deleted.” To counter this, consider the following:

If you use this software, you trust it, just like PG itself (but even when using PG, you create a postgres OS user rather than using root directly). As for extensions, treat them as part of the PG software. PG is trusted and can be installed directly in production because of its industry reputation. The same goes for extensions — choose reputable extensions rather than using them indiscriminately. This is essentially the difference between PostgreSQL community gatekeeping and extension provider gatekeeping. For cloud service providers, many extensions aren’t supported — the cloud provider assumes the gatekeeping function and the responsibility of taking the blame.

Version Convergence
#

PG extension versions have these characteristics:

The same extension may have different extension packages for different database versions.
Extensions have different versions.

This means that without version management, you’ll end up with unmanageable numbers of software versions. To address this, limiting specific PG versions to installing specific extension versions is a good approach. As for extension upgrades needed for certain requirements, implement them through PG version upgrades. This strategy sacrifices some flexibility to ensure stability. I personally think it’s worthwhile — the need to upgrade extensions itself isn’t common, but it can reduce many software management issues and unknown compatibility problems.

Consider Compatibility When Using Extensions
#

Since extension compatibility isn’t great, managing extensions becomes especially important — we don’t want the database returning strange results or even crashing while running.

Extension management strategy: 1. Install necessary extensions. 2. Create needed extensions on demand. 3. Don’t install obscure extensions.
Search the compatibility matrix. While PG compatibility testing isn’t perfect, it’s still valuable. Since the paper isn’t directly searchable for the compatibility matrix, you can “ctrl+f” search the ext-analyzer compatibility table to preliminarily assess whether extensions you need have good compatibility.

Trivia
#

In the 1976 INGRES paper, UDFs were already implemented through extensions. Even POSTGRES carried forward this functionality in its 1986 initial release. Oracle’s UDF implementation came in Oracle 7, released in 1992 — much later than PG.

The SQL standard didn’t include UDFs until 1996 — a full 20 years after INGRES’s UDF. Stonebraker indeed wasn’t very focused on driving standards.

Original link: https://lastdba.com/2026/01/03/论文精读插件无政府状态/

Paper Deep Read: DBAIOps

Sun, 21 Dec 2025 00:00:00 +0000

Paper: DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs

Repo: https://github.com/weAIDB/DBAIOps/

What is DBAIOps
#

Why DBAIOps:

Manual operations are extremely time-consuming.
Manual operations are difficult to scale.
Manual operations are often trapped in recurring failures.
Documentation + RAG models are inaccurate (limited DBA experience integration).

In short, both manual operations and existing solutions are mediocre, hence DBAIOps — an operations system combining LLM reasoning and knowledge graphs to achieve DBA-like diagnostic capabilities.

Comparison of database failure analysis approaches:

Rule-based approach: Traditional, rigid.
Machine learning approach: Essentially rule-based with similar limitations; depends on training data leading to lower generation capability; generally suitable for diagnosing common specific problems.
LLM-based approach: Uses general documentation and LLMs (e.g., decision-tree-based), prone to giving generic results.
LLM+RAG approach: Searches based on chunked top-k approximate knowledge; results are inaccurate.

After comparing the above approaches, the advantages of DBAIOps combining graph knowledge, DBA experience, and LLMs are clear:

Incorporates DBA experience.
Preserves original relationships.
Supports new root cause identification and solutions.
Extensible.

Overview
#

Left side is architecture, right side is an example.

Offline: DBA experience is embedded into Neo4j, with the resulting graph model called ExperienceGraph, where edges represent anomaly phenomena or metric relationships. The embedded anomaly model is called AnomalyModel.

Online: Anomaly analysis, retrieval, and report generation. The AnomalyProcessor extracts standard failure information and AnomalyModel information, then retrieves the graph via ExperienceRetriever; finally, RootCauseAnalyzer calls the LLM to generate analysis reports.

From the right-side example, we can see graph relevance finding LOG FILE SYNC associated with LOG WRITE performance and IO performance; through REDO ALLOCATION, we can find table structure changes and DDL.

The Operations Experience Graph Model
#

Unlike rule-based or document-chunk-based RAG, ExperienceGraph is a graph model encoding heterogeneous operations experience information. The graph contains three elements: (vertices, directed edges, relationships on edges).

Based on the characteristics of operations experience, DBAIOps classifies vertices:

trigger vertex: Used to detect database anomalies; the entry point for anomaly analysis. For example, LOG FILE SYNC is an entry vertex.
metric vertex: Database runtime metrics. For offline knowledge, this refers to metrics from operations case studies (if present).
experience vertex: Encodes domain-specific operations experience, covering anomaly meanings and handling methods. For example, LOG FILE SYNC exceeding 60ms indicates overly frequent commits or parameter adjustments needed.
tool vertex: Executable scripts for collecting and analyzing anomaly metrics.
tag vertex: Semantic categories of graph vertices. For example, “Concurrent Transactions” involves multiple vertex types; tag vertices strengthen cross-case associations.
auxiliary vertex: Explains the meaning of metrics.

Edge classification:

containment edge: Trigger Vertex - Experience Vertex
relevance edge: Trigger Vertex - Metric Vertex
diagnosis edge: Experience Vertex - Metric Vertex
synonym edge: Only appears between Tag Vertices, indicating semantic synonymy, e.g., physical_read and disk_read; shared_pool and shared_buffer.

Analyzing the operations experience graph model through an example:

LOG FILE SYNC has multiple TAGs, and TAGs are associated with Experience, metrics, and tools. The strong relevance is evident — it represents a human DBA’s understanding and operations experience of LOG FILE SYNC.

Graph Construction
#

Manual graph construction is unreliable, and existing ML-generated graphs may generate irrelevant relationships, so a semi-automatic graph generation approach is proposed.

Graph initialization: This part is manually generated, defining trigger vertices according to rules. Once trigger vertices are generated, their associated metric vertices, experience vertices, etc., are automatically generated. This is somewhat like a human DBA guiding the creation of a knowledge sketch — the overall framework cannot be changed; nothing bizarre should be generated.
Graph storage: Stored in Neo4J. Additionally, different database types are marked with tags, making much knowledge reusable and avoiding duplicate graph construction.
Graph augmentation: Generating more edges.
Graph updates: DBAIOps supports incremental updates. Updates here include both adding new vertices and removing old vertices.

Anomaly Model
#

Metrics
#

Metrics come from many sources, including runtime information (CPU %, throughput, etc., routine monitoring), logs, traces, etc. Combined with relevance differences, strongly correlated metrics need to be extracted. So metrics are divided into 2 categories:

Immediately collected metrics: Runtime information, logs, traces.
Subsequently collected metrics: Periodic, delta, etc., metrics generated when needed, such as AWR/ASH data.

Regarding metric-anomaly correlation, unlike baseline-based approaches, DBAIOps uses specific metric combinations for each anomaly type.

Finally, a formula determines whether an anomaly has actually occurred:

Two-Stage Graph Evolution
#

Database anomalies rarely occur in isolation — one performance issue may simultaneously trigger or exacerbate others. However, connections between different anomaly models (e.g., LOG_FILE_SYNC and REDO_ALLOCATION) in pre-built knowledge graphs tend to be loose, with shared experience fragments sparse and fragmented. This makes it difficult for traditional methods to discover cross-model composite root causes, such as combined I/O bottleneck and memory pressure issues.

To address this challenge, DBAIOps proposes an automatic “graph evolution” mechanism that dynamically discovers and connects relevant experience fragments between different anomaly models, evolving the knowledge graph from an initially sparse structure into a densely interconnected network, thus supporting more comprehensive root cause analysis.

Stage 1 - Graph Inference and Proximity Discovery: Uses graph query language (Cypher) to collect and aggregate relevant metrics, traversing related nodes and edges based on configurable thresholds to build association networks. For example, starting from LOG_FILE_SYNC latency, traverse up to 3 hops of associated nodes. Establish connections between LOG_FILE_SYNC and REDO_ALLOCATION models because they are both related to I/O-related concurrency issues. Through multiple iterations, the knowledge graph gradually evolves into a denser structure, enabling diagnosis to consider more potential factors and composite causes.
Stage 2 - Adaptive Abnormal Metric Detection: Identifies truly anomalous metrics along graph expansion paths. Using an Adaptive Detection Function (ADF), it calculates composite anomaly scores considering dimensions such as metric volatility and dynamic baseline deviation. Based on anomaly scoring results, it decides whether further knowledge graph structure expansion is needed, filtering a precise subset of anomaly metrics for subsequent LLM root cause reasoning.

Generating Analysis Reports
#

Once the graph is ready, prompts need to be fed to the LLM to generate desired reports. A well-structured prompt can also improve report accuracy.

Anomalies have 5 components, which serve as the prompt for the LLM:

Anomaly: Anomaly description (“CPU usage spiked to 95% at 16:00 on 2023-10-05”)
Condition: Anomaly trigger condition (“exceeds 90% for >5 min”)
Metrics
Experience: Provides normal load values or recent maintenance tasks.
Output: Describes the report’s composition — anomaly verification (requiring further analysis), root cause analysis, recovery plan, summary, SQL text.

Some personal thoughts:

Recent maintenance tasks are very useful — maintenance tasks generally have strong correlation, and failure analysis can’t just be simple technical analysis. However, who updates these maintenance tasks and which ones to update or not update is a problem.

The first few items in output are easy to understand, but the last one — SQL text — is a stroke of genius. In production environments, aside from hardware failures, database runtime status is strongly correlated with SQL. I personally believe you can unthinkingly capture SQL and discuss causality later. From an operations perspective, failures always require joint investigation with developers, so SQL text is basically mandatory to capture.

Evaluation
#

Comparison of analysis report quality across different tools and approaches:

Impressive results. Notably, DBAIOps specifically emphasizes that mid-sized LLMs already produce good analysis results. This is important — DeepSeek-R1 671B running bare isn’t bad, but the cost is on a completely different level.

Nitpicking
#

Can’t really be called “Ops” — it only has failure analysis functionality. Ops content is vast; failure analysis is just the tip of the iceberg.
Graph classification doesn’t match the graph example. The defined tag vertices and edges differ significantly from the example.

The vertices in the example play important roles, but these edge types aren’t defined: tag vertex-tool vertex, tag vertex-experience vertex, tag vertex-metric vertex. And the edges that should exist seem mostly absent, with only synonym edges present.

Undescribed parts of the example should be listed, otherwise it’s confusing.

The two-stage graph evolution results are a bit odd:

w/o ADF means without Stage 2 graph evolution (adaptive abnormal metric detection). w/o ADF should mean without Stage 1 graph evolution (graph inference and proximity discovery). w/o ADF means without either stage of graph evolution.

Here, the case with both stages of graph evolution is missing — having it would better demonstrate the effectiveness of two-stage graph evolution.

Root causes are somewhat limited:

The circled ones should be relatively common (I only looked at Oracle and Postgres), but these root causes are currently absent.

PG’s root causes are a bit sparse. Dirty page flushing generally isn’t a major issue — as a root cause, it probably ranks behind many other root causes.

Summary
#

Points I personally really like:

GraphRAG should be better than vector RAG for failure diagnosis.

(GraphRAG original paper: From Local to Global: A GraphRAG Approach to Query-Focused Summarization)

SS represents vector RAG, TS represents source text summaries, and C0/C1/C2/C3 represent GraphRAG at different knowledge granularities. From this chart, we can simply conclude: GraphRAG is better suited for multi-document complex scenarios and multi-angle analysis, but may not necessarily outperform vector RAG in precision.

Semi-automatic graph generation approach.

Graph generation is semi-automatic — trigger vertices are manually created, others can be auto-generated. For example, LOG FILE SYNC is a trigger vertex. Failure entry points can indeed be made into obvious anomaly points — these are the entry points. Same for PG, same for any failure — it aligns with human logic for understanding failures.

Automatic graph evolution.

Strengthening associations between certain vertices is meaningful, as evident from the “Performance of DBAIOps Variants” table.

Automatic baseline adjustment.

In Observability Engineering, there’s this passage about AIOps:

AI can only help when there are clearly discernible patterns and it can identify shifting baselines for prediction — such AIOps doesn’t exist yet.

DBAIOps in my eyes:

Clearly discernible patterns = DBAIOps’s graph, which includes failure models, anomaly relationships, monitoring data, and logs.

Shifting baselines = DBAIOps’s adaptive abnormal metric detection.

In summary, it’s a significant advancement over random chunking of failure knowledge, setting a single baseline, and vector approximate search in RAG models.

Original link: https://lastdba.com/2025/12/21/论文精读dbaio-ps/

CXL and PolarDB-CXL

Sun, 30 Nov 2025 00:00:00 +0000

Paper: Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases

SIGMOD best paper: https://sigmod.org/sigmod-awards/sigmod-best-paper-award/

CXL and PolarDB-CXL
#

What is CXL
#

CXL: An open industry standard, a high-speed interconnect specification formulated by the CXL Consortium (founded in 2019 by tech giants Intel, AMD, ARM, etc.). It represents the evolutionary direction of computing architecture. Currently at CXL 4.0.

Feature	CXL 1.0/1.1	CXL 2.0	CXL 3.0/3.1	CXL 4.0 (latest)
Release	March/Sept 2019	October 2020	August 2022 / November 2023	November 2025
Base Protocol	PCIe 5.0 (32 GT/s)	PCIe 5.0 (32 GT/s)	PCIe 6.0 (64 GT/s)	PCIe 7.0 (128 GT/s)
Max Bandwidth	1TB/s	1TB/s	2TB/s	4TB/s+
Topology Scale	Point-to-point / simple star	Single switch (≤32 nodes)	Multi-level Fabric (4096 nodes)	Ultra-large-scale Fabric

From my research, two descriptions of CXL left the deepest impression:

Memory as a Service
Near-memory computing and expansion

CXL switch: A switching chip, physical hardware. Many vendors are working on industrial implementations. The paper specifically references products from XConn Tech: CXL 2.0 switch. Note that as of November 22, 2025, XConn only has CXL 2.0 switches, no 3.0 products. However, there are products on the market supporting 3.0+ standards, such as Panmnesia CXL 3.2 Fabric Switch.

PolarCXLMem: According to the paper, “the first CXL-switch-based disaggregated memory system.” But the paper also states “we leverage the world’s first CXL switch[50]” — specifically referring to the XConn tech CXL 2.0 switch — and then says “PolarCXLMem is the first CXL-switch-based disaggregated memory.” This can be interpreted in two ways:

The first disaggregated memory system based on CXL switches
The first disaggregated memory system based on XConn tech CXL 2.0 switches

PolarDB-CXL: The paper doesn’t actually use this term, but the industry uses it. It represents “integrate PolarCXLMem into the multi-primary version of PolarDB, known as PolarDB-MP” — essentially “the CXL-upgraded version of PolarDB-MP.” The paper repeatedly uses lengthy phrases but never uses the term polardb-cxl. For convenience, this article uses polardb-cxl to represent its essential meaning.

RDMA vs CXL
#

PolarDB-MP uses RDMA architecture, while PolarDB-CXL uses CXL architecture:

(https://medium.com/@anan.mirji/cxl-switch-vs-rdma-a-technical-comparison-for-high-performance-interconnects-6aaa031cde31)

RDMA architecture is a cross-host distributed interconnect architecture, while CXL architecture is a single-host expanded interconnect architecture.

Key differences:

Dimension	RDMA Architecture	CXL Architecture
Topology	Multi-host + network switch distributed arch	Single-host + CXL switch expanded arch
Communication	Network (InfiniBand/RoCE)	PCIe bus (CXL based on PCIe physical layer)
Core Components	RDMA NIC (dedicated NIC)	CXL Controller, CXL Switch
Resource Ownership	“Remote resources” across independent hosts	“Expanded resources” within the host architecture

CXL’s Advantages
#

CXL’s advantages over RDMA:

Low latency: CXL connects to host or device memory via PCIe; RDMA requires protocol interface conversion between InfiniBand and PCIe.

Instruction support: CXL provides native load/store instructions, allowing the CPU to directly manipulate remote CXL device memory as if it were local memory. RDMA requires reading from remote memory to local memory, processing locally, then writing back to remote memory.

Simplified applications: RDMA requires special interfaces and drivers, needing professionals to design complex programs; CXL provides transparent memory space, greatly simplifying application design.

Memory fusion: CXL 3.0 supports physical hardware-level memory pooling.

Problems with PolarDB-MP and the value CXL provides:

CXL’s critique of MP:

Memory pages are 4-16K, so even when only a small amount of data transfer is needed, data must move between local and shared memory, causing read/write amplification.
Maintaining local memory adds extra memory overhead, reducing throughput.
Recovery is very time-consuming.
RDMA is far better than TCP/IP, but under high concurrency, it suffers from “doorbell register implicit contention” and “cache thrashing” issues.
The database itself must maintain shared memory.

Benefits CXL brings:

Eliminates the “shared memory - local memory” hierarchical memory structure, also eliminating the maintenance overhead and read/write amplification. Because CXL load/store to local memory is fast enough, it allows directly storing all buffer pages.
Uses cache lines (64B) as the minimum transfer unit between CPU cache and main memory, rather than PolarDB-MP’s 4K pages.
Saves main memory. DRAM costs are very high, roughly 40-50% of server/rack costs.
Simplifies system design. Minimal modifications to existing systems are important for commercial database stability.
PolarRecv: An instant recovery system built on CXL. After a database crash, data and metadata remain on CXL, allowing direct reads of consistent state from CXL memory, so recovery is very fast. (This seems similar to how PG’s page cache helps fast startup after a crash.)

DRAM vs RDMA vs CXL:

When data volume is small, RDMA has significantly higher latency than CXL; with larger data, RDMA’s latency improves slightly. Local DRAM access is slightly better than CXL access.

Overall, CXL memory access latency is slightly higher than DRAM but better than RDMA.

Regarding CXL’s higher latency vs DRAM, the paper explains: “database buffer pool operations are more sensitive to bandwidth than latency” — for database memory, bandwidth matters more than latency.

Custom Rack
#

Self-developed physical prototype rack. The left rack integrates two CXL switch-enabled clusters, each connected to memory devices and hosts; the right rack integrates one CXL switch connected to memory devices and hosts.

PolarCXLMem
#

The CXL 2.0 switch supports memory pooling, but the drivers don’t fully support it, so PolarCXLMem still designed its own CXL memory allocation and usage — it’s not fully transparent. PolarCXLMem processes CXL memory into a multi-tenant model, with different host nodes allocated different CXL memory regions.

PolarCXLMem characteristics:

Nodes have their own CXL memory regions; different nodes’ CXL memory does not overlap.
The buffer pool is allocated at database startup (by the CXL mem manager in the diagram) and does not change during runtime.
The memory unit structure in CXL mem is a block, which stores page data and page metadata, including: id (page id), lock state (whether the page is locked for update), prev/next (LRU doubly-linked list), lsn (latest log sequence number of the page).
Free list / in-use list is used for LRU.

Question: PG’s page header has lsn, starting free space pointer, prune xid, etc. What does PolarDB-CXL’s page header structure look like?

PolarRecv
#

PolarDB-MP was designed based on RDMA, where data pages are written locally, and the disaggregated shared memory doesn’t contain the latest version of data pages. This means after a host crash, you must scan and apply all redo log files (the paper says redo, not WAL) or pages from a small amount of shared memory.

CXL switches have independent power, so even if the host crashes, the latest data remains in CXL memory. PolarRecv leverages this to dramatically speed up database recovery after host crashes.

However, while CXL switch memory is transparent and persistent, directly using it after a crash still requires handling these issues:

LRU lists may be inconsistent at crash time
B-tree SMO (B-tree structure changes), such as index splits, may be inconsistent at crash time
Pages being updated at crash time may be inconsistent
The redo log buffer uses local DRAM. When the redo log hasn’t been flushed to disk at crash time, the page LSN in the CXL buffer pool may be greater than the LSN in the redo log file, directly violating the ARIES principle

PolarRecv’s design strategies:

Use mutex to protect the LRU structure. The mutex lock state indicates whether LRU was being modified at crash time. If so, LRU must be rebuilt; if not, use the LRU directly from CXL memory.
During B-tree SMO, a mini-transaction protects index pages. This mini-transaction is a two-phase lock corresponding to page locks. It’s only flushed to the redo log when the mini-transaction commits. So during recovery, if an index page is found with a write lock, recover from the redo logs.
PolarCXL’s read/write locks are stored in CXL memory. If a write lock still exists, it means the update was in an intermediate state at crash time and not completed. In this case, honestly read the page from the redo log file rather than reading an inconsistent page from CXL memory.
During recovery, first obtain the maximum LSN from the redo log, then check the lock and LSN of pages in CXL memory. If a page’s LSN in CXL memory is greater than the max LSN, rebuild the page using redo log information rather than using the CXL memory version.

Memory Fusion
#

Because PolarCXLMem is designed based on the CXL 2.0 switch, and CXL 3.0 supports memory fusion, memory fusion design is still needed. Since each node’s buffer pool is placed in isolation in PolarCXLMem, CXL 2.0’s memory fusion is achieved through DBP metadata management — each buffer pool only stores the page’s CXL memory address, not the page itself.

To understand this diagram, you need to distinguish between CXL memory, DBP, and local buffer:

CXL memory is the physical hardware, CXL mem itself.
DBP is a region carved out of CXL for managing memory fusion services.
Local metadata buffer contains local buffer metadata and part of CXL.

Also understand that for each page in the buffer pool, there are two flags:

invalid: After another node writes to the page, the current node needs to invalidate its local CPU cache.
removal: When a page moves from the in-use list to the free list, all nodes must set the removal flag.

Memory fusion page access flow:

The requested page is not in the local page metadata buffer: 1.1 Allocate a new meta record from the free list, and provide invalid and removal addresses to the memory fusion service via RPC.

The requested page is in the local page metadata buffer: 2.1 First check the removal flag. If removal is set, it means the memory fusion service has already reclaimed the page, and a new memory address must be requested from the memory fusion service via RPC. 2.2 Then check the invalid flag. If invalid is set, it means the page has been modified by another node, and the CPU cache must be invalidated to ensure consistency.

Fusion consistency:

Since CXL 2.0 doesn’t have memory fusion, CPU caches aren’t automatically updated. PolarCXL implements multi-node concurrent write control through page-level locks.

Nodes must acquire read/write locks to read/write pages. When one node is writing to a page, other nodes cannot read or write that page. After a node finishes writing, it must also:

Flush the CPU cache to CXL mem (cache line flush) to ensure CXL mem has the latest page version.
Set the invalid flag to ensure other nodes don’t read stale page versions from their CPU caches.

Memory fusion summary:

CXL 2.0 itself supports incomplete memory fusion, meaning the database layer still needs to design a memory fusion scheme. Memory pages are accessed via CXL addresses, rather than local/remote access to entire pages as in the RDMA approach. The local CPU cache needs the database layer to flush it to ensure node data access consistency — this is a hard limitation. This also means cross-node updates still use exclusive page-level locks (the RDMA approach also uses exclusive page-level locks).

Performance Evaluation
#

Multi-Node Read/Write
#

Benchmarking with 12 instances on a 192 vCPU host, comparing RDMA (PolarDB-MP) vs CXL (PolarDB-MP with PolarCXLMem) performance:

Point queries:

Range queries:

Read-write:

Point queries: Read amplification is most severe for point queries. CXL’s bandwidth consumption is 3-4x lower than RDMA. When reaching 3 nodes, RDMA bandwidth is already saturated — adding more nodes doesn’t improve bandwidth.
Range queries: Read amplification is less severe. Only at >4 nodes does it reach the bandwidth ceiling of 11GB/s, while CXL can still scale linearly with nodes.
Read-write: Performance is similar to range queries, just with smaller differences.

PolarRecv Recovery Time
#

vanilla: Refers to the general approach, probably similar to PG reading from local cache or disk (possibly polar redo).
RDMA-based: Refers to PolarDB-MP where some data can be read from disaggregated shared storage.
PolarRecv: Refers to continuing to read most data from CXL, with only a small amount of partial pages needing recovery from redo files.

The paper discusses recovery time in 2 phases: startup/recovery and reaching pre-crash load levels. Read-only doesn’t need recovery — as long as there’s data, you can start and take load. When writes exist, recovery is needed, and the advantage of continuing to read from CXL memory becomes apparent. The difference between 1-minute, 2-minute, and 4-minute recovery times is significant — it could be the difference between business being nearly imperceptible and noticeably impacted.

Shared Data Updates
#

The focal point of distributed database performance combat is updates to shared data. After PolarDB-MP crushed Taurus-MM, PolarDB-CXL also crushed PolarDB-MP:

At 0% shared data, the RDMA-based solution just accesses local buffers, and PolarDB-CXL just treats CXL as a memory pool. Even so, CXL-based still performs better, mainly due to the read/write amplification and bandwidth ceiling issues of the RDMA-based solution mentioned earlier.

From the performance comparison chart above, it’s clear that PolarDB-CXL significantly outperforms PolarDB-MP. The data is very clear. However, note that when shared data >60%, PolarDB-CXL’s performance improvement becomes less significant, mainly because:

Page-level locks become the bottleneck.
As lock contention intensifies, processes enter sleep states, and frequent context switching further exacerbates resource contention.

Summary
#

PolarDB-CXL advantages:

Eliminates RDMA’s “local-remote” hierarchical memory structure design.
Resolves RDMA’s read/write amplification problem.
Provides a CXL-based memory pool.
PolarRecv, based on CXL persistent memory, enables faster database crash recovery.
Benchmarking shows PolarDB-MP CXL outperforms PolarDB-MP RDMA.

PolarDB-CXL disadvantages:

Cross-node updates still use page-level locks, which remain the main performance bottleneck in shared data update scenarios.
The CXL 2.0 switch seems a bit dated — by the time the paper was published, switch devices supporting 3.2 were already available, and CXL 4.0 was announced in November 2025. We can predict future databases built on newer CXL standard switch devices.
The paper quality isn’t actually as high as the MP paper — it mainly revolves around solutions for the CXL 2.0 switch physical hardware, which differs from the extensive database-layer design found in the PolarDB-MP paper.

Original link: https://lastdba.com/2025/11/30/论文精读polar-db-cxl2025-sigmod最佳工业论文/

Paper Deep Read: PolarDB-MP | 2024 SIGMOD Best Industrial Paper

Sun, 30 Nov 2025 00:00:00 +0000

Paper: PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory

SIGMOD best paper: https://sigmod.org/sigmod-awards/sigmod-best-paper-award/

Foreword and Abstract
#

The paper opens with the problem: primary-replica architecture’s write throughput is limited by the primary. Shared-nothing architecture offers scalable multi-primary clusters that can solve the single-primary limitation, but this architecture suffers performance bottlenecks due to distributed transaction overhead. Recently, shared-storage-based cloud-native multi-primary databases have emerged, but under high-conflict scenarios, they face high conflict resolution costs and low data fusion efficiency.

So the problem is: single-primary primary-replica, shared-nothing, and shared-storage cloud-native multi-primary architectures all have their own issues.

This paper proposes PolarDB-MP, a novel multi-primary cloud-native database combining disaggregated shared memory with shared storage. (Since multi-primary cloud-native databases already exist, it needs to be “novel.”)

PolarDB-MP’s basic characteristics:

All nodes can equally access all data, allowing transactions to be processed independently on a single node, without traditional distributed transaction mechanisms.
Shared storage: PolarStore and PolarFS, or other compatible shared storage solutions.
Built on disaggregated shared memory.
Low-latency communication via RDMA (Remote Direct Memory Access).
LLSN (Local Logical Sequence Number): Used to establish partial order for WAL logs generated by different nodes, combined with custom recovery strategies to ensure consistency and efficiency during abnormal recovery.
Core component PMFS (Polar Multi-Primary Fusion Server) responsible for:
- Transaction Fusion — transaction ordering and visibility management
- Buffer Fusion — distributed shared buffer mechanism
- Lock Fusion — cross-node concurrency control

Classification
#

The classification is mainly to understand PolarDB-MP’s historical position and the “first” qualifier:

PolarDB-MP is the first multi-primary cloud-native database that utilizes disaggregated shared memory and shared storage for transaction coordination and buffer fusion

Competitor Weaknesses
#

Shared-nothing products: The paper doesn’t call out individual products, just one line: transactions accessing across multiple partitions require significant additional overhead for distributed transactions.

Oracle:

Expensive distributed lock management
Expensive network overhead
Reliance on sophisticated hardware (alien tech)
Difficult to migrate to cloud, or higher TCO (including maintenance and labor costs) compared to cloud-native databases after migration

AWS Aurora-MM:

Uses optimistic transaction model; high transaction abort rates under conflicts
In some scenarios, 4-node throughput is lower than single-node

Huawei Taurus-MM:

Pessimistic transaction model. Relies on page storage and log replay to ensure cache consistency, with high overhead in concurrency control and data synchronization.
Under 50% shared data read-write workload, 8 nodes only achieve 1.5x single-node performance improvement

The Oracle critique here is mainly plausible-sounding trash talk, while Aurora-MM and Taurus-MM have original vendor citations:

Aurora-MM “in some scenarios, 4-node throughput is lower than single-node”
Taurus-MM “under 50% shared data read-write workload, 8 nodes only achieve 1.5x single-node performance improvement”

Transaction Fusion
#

Transaction Fusion Overview
#

How does multi-primary ensure consistent data views?

Snapshot isolation is a common MVCC implementation. A characteristic of snapshot isolation is that queries or transactions must maintain their consistent data view during execution. But in multi-primary architecture, local nodes cannot guarantee consistent data views due to remote data updates.

To solve this, general multi-primary shared-storage architectures introduce global transaction mechanisms (Aurora-MM or Taurus-MM). PolarDB-MP introduces an innovative technique — transaction fusion within PMFS. Each node only maintains local transaction information, which can be accessed by other nodes via RDMA. In contrast to global transactions, transaction fusion is decentralized.

Local Transactions and TIT Table
#

Each node in PolarDB-MP maintains a small amount of memory to store local transaction information (accessible by other nodes via RDMA). This local transaction information is stored in the transaction Information Table (TIT).

TIT table contents:

Transaction object pointer
Commit timestamp (CTS) assigned by the global timestamp coordinator (TSO)
version, representing different transactions in the same slot
ref, indicating whether this transaction is being waited on by other transactions for lock release (probably PLock or RLock)

How Transactions Proceed
#

When a transaction begins, a local transaction id (presumably txid) is assigned, and the TIT slot stores the transaction object pointer, ref initialized to 0, and CTS initialized to CSN_INIT.

PolarDB-MP uses a global transaction ID to identify a transaction: global transaction ID = (node_id, trx_id, slot_id, version). The global transaction ID does not include CTS. To know the commit order of transactions, such as when constructing a transaction visibility view, you need to go through the global transaction ID, via RDMA, to the target node to find CTS (similar to PG’s pg_xact_commit_timestamp() function, which finds the corresponding transaction commit time from local files using the transaction id).

If trx_id is the transaction ID in PG, then node_id + trx_id can identify the global uniqueness of a transaction, or node_id + slot_id + version could also work to some extent (when slot id is not reused, e.g., at a given moment it uniquely identifies a transaction). Of course, the extra information combined is also unique. After all, this information is key to PolarDB-MP’s transaction fusion implementation.

Each transaction constructs a visibility view using the global transaction ID and CTS. The visibility view concept is consistent with PG: the current read view can read data rows committed before the read view, and the latest version rows.

Accessing Remote CTS
#

Since CTS is local (in TIT or on the local filesystem), obtaining the reading transaction’s CTS is an interesting task:

1.1 If a row’s CTS is CSN_INIT/CTS_INIT, meaning the transaction is still active, return the maximum CTS to indicate it’s invisible to all transactions except itself.

If a row’s CTS is not CSN_INIT/CTS_INIT, meaning the transaction has committed, and it’s in the local TIT, directly return CTS.
If a row has no CTS, obtain CTS via the row’s g_trx_id.

2.1 If the transaction belongs to the local node (g_trx_id has node id), read from local filesystem to local TIT.

2.2 If the transaction doesn’t belong to the local node, read from remote filesystem to remote TIT via RDMA.

3.1 If slot.version != g_trx_id.version, the transaction must have committed, so the row is definitely visible to all transactions. Return minimum CTS to indicate visibility to all transactions.

3.2 If slot.version = g_trx_id.version, refer to 1.1, 1.2.

PolarDB-MP’s transaction visibility concept is very similar to PG’s, except PG uses txid instead of CTS to indicate transaction ordering and doesn’t need to consider remote access.

Row Update Transactions
#

Additionally, row updates are also very similar:

When PolarDB-MP updates a row, besides updating the data itself, it must also:

Update the row’s global transaction ID (g_trx_id) (if it’s an in-row update, then it modifies PG’s row header).
Update the row’s CTS. (The paper doesn’t specify whether this is in the row header or filesystem. If similar to PG, it should be in the commit_ts directory on the filesystem. Polar not confirmed.)

Questions About Transaction Fusion (Things I Didn’t Understand)
#

g_trx_id is row metadata written to disk. If nodes are added or removed, does the node_id in the data row’s g_trx_id need updating? If not, which node should the row be loaded into when read next time?

A new row’s CTS is stored on local node A. If another node B updates this row, is the new CTS on node A or B?

“assigned a read view, which consists of its own g_trx_id and the current CTS.” Do read-only transactions also get assigned a g_trx_id when constructing a read view?

Without a doubt, a parameter like track_commit_timestamp must be forcibly enabled.

If there are many writes on node A and reads on node B, B’s reads will access A’s TIT data via RDMA — does this generate significant network IO? Should this be considered when designing read-write separation or multi-node reads and writes? The original paper might answer this — “Multi-primary architectures inherently require synchronizing large amounts of data and messages between nodes to support concurrent access across multiple nodes. As network technology develops (InfiniBand, RDMA) and achieves commercial deployment, the network bottleneck becomes less significant.”

Global timestamps could become a bottleneck in distributed systems. PolarDB-SCC is a shared-storage-based timestamp solution that appears to perform well. Due to time constraints, I’ll set this aside for now.

Buffer Fusion
#

Buffer Fusion Introduction
#

Each node in PolarDB-MP can update any data page, leading to substantial data transfer. Buffer Fusion’s distributed buffer pool (DBP) is designed to solve this problem. Each node has a local buffer pool (LBP), which is a subset of DBP.

How Buffer Fusion Works
#

LBP has two new metadata items for pages:

valid: whether the page has been updated by another node
r_addr: pointer to the page in DBP

When accessing a page from LBP, the current node must first check if the page is valid. If invalid, it must access DBP via r_addr. After DBP stores a new version of the page, buffer fusion invalidates all remote pages. In LBP, dirty pages are periodically flushed to DBP in the background or after releasing the PLock lock.

Page access steps:

1.1 If the page is in LBP and valid, access directly. 1.2 If the page is in LBP and invalid, access DBP via RDMA. 2. If the page is in neither LBP nor DBP, read from shared storage. 3. The page is loaded from a node into LBP and registered in DBP.

PolarDB’s buffer fusion key component is disaggregated shared memory. It appears to be a/group of physical hardware or an integrated component built on top of it, separate from compute nodes. This differs significantly from memory in traditional distributed systems.

It’s also different from transaction fusion: transaction fusion requires accessing remote nodes with the same architecture, while buffer fusion doesn’t require accessing remote nodes with the same architecture — it separately accesses the disaggregated shared storage component.

Questions About Buffer Fusion (Things I Didn’t Understand)
#

Disaggregated shared memory seems like a component separate from standard hosts — so what exactly is it?

Lock Fusion
#

Lock Types in Lock Fusion
#

Buffer fusion solves how nodes access remote data; lock fusion solves concurrent access control.

Buffer fusion has two types of locks:

page-locking (PLock): Similar to latches, controlling atomic access and internal structure consistency. Single-node page access doesn’t use PLock.
row-locking (RLock): Responsible for cross-node transaction control, following the two-phase lock protocol.

PLock Access Flow
#

(The paper doesn’t say where lock fusion occurs. Since PLock is a page-level latch and page fusion happens on shared memory, I’ll assume lock fusion also occurs on shared memory, as this is easier to understand.)

Before updating/reading a page, the local lock manager checks whether the local node already holds the corresponding X/S PLock (or higher-level lock). 1.1 If yes, execute in place. 1.2 If no, acquire PLock through Lock Fusion.
Lock fusion checks for conflicts before responding; if a conflict exists, the request waits.
When PLock is released by a node, it notifies Lock Fusion, which updates PLock’s state and notifies other nodes to continue their operations.

PLock Lazy Releasing
#

According to the PLock access flow above, a PLock is immediately released after local operations complete. This may not be optimal — according to temporal locality: “a data item or instruction accessed at a given time is likely to be accessed again in the near future.” Lazy releasing minimizes PLock lock RPC access load.

The principle is simple: PLock is not immediately released after use on the local node; it’s only released when ref reaches 0.

When other nodes need PLock, Lock Fusion also sends negotiation messages to intervene when the local node is holding the lock; the local node must communicate with Lock Fusion rather than autonomously handling PLock. Lock Fusion uses a “first-in-first-out” strategy to resolve cross-node lock ownership, again until the local node’s ref = 0, at which point other nodes can acquire the lock.

Lazy releasing is an effective distributed lock solution, balancing local lock optimization with global lock allocation.

RLock Overview
#

RLock uses the global transaction ID for determination (similar to PG). According to the transaction fusion content, the global transaction ID contains node id, transaction id, slot id, version. So when a local node reads a row, it can directly obtain the lock information on the row, know where the lock is (node id), and know if the lock is active.

There are two interesting points about determining transaction activity:

From the transaction fusion flow of accessing remote CTS: if the transaction’s CTS is a valid value, or the transaction is in the same slot in TIT but not the same version, the transaction has definitely committed, so no need to check activity. If the source transaction is not active, there’s no need to wait for locks — proceed directly.
PG has the concept of a minimum active transaction ID, which also exists in PolarDB-MP. If the transaction ID on the row is less than the global minimum active transaction ID, the source transaction must have also committed (or rolled back).

How RLock Works
#

Local rows are handled locally; only conflicts are processed in Lock Fusion; cross-node row locks require RLock. “The transaction ID in the row functions as a lock indicator. So this protocol only supports exclusive (X) lock. The shared (S) lock on a row is not supported in PolarDB-MP, but it’s acceptable.” Only truly conflicting exclusive locks need RLock; shared locks don’t need RLock.

T30 reads the row from shared storage and can determine from the row’s metadata (g_trx_id) that the transaction is active and which node it’s on.
T30 remotely adjusts T10’s transaction ref.
T30 sends a wait status to the Lock Fusion service.
Lock Fusion adds wait information to the wait info table.
T10 finishes execution and notifies Lock Fusion.
Lock Fusion checks the wait info table, then notifies T30 it can continue.

Questions About Lock Fusion (Things I Didn’t Understand)
#

“when attempting to update a row, it must already hold an X PLock lock on the page containing the row”

Updating also requires holding an exclusive PLock on the page, meaning updates on the same page block each other — doesn’t this affect concurrency? Locally, there shouldn’t be such behavior; PG doesn’t have page-exclusive locks for update scenarios.

In the “Logs ordering and recovery” chapter, there are two statements: “Thanks to the PLock design, only one transaction can update a page at a time” and “When a page is updated across two nodes, one node pushes its updated page to the DBP before releasing the PLock, allowing the next node to retrieve it from the DBP.”

Yes, during cross-node data updates, there are page-level exclusive locks.

PMFS Summary (Hot Take)
#

PMFS (Polar Multi-Primary Fusion Server) is the core component implementing PolarDB-MP’s multi-primary distributed system. Among its features, the global transaction ID design is ingenious — it transforms PG’s transaction ID into one containing node information, transaction id, and transaction fusion’s slot and version information, placed in the row header. This has several benefits:

Directly accessing a row reveals the row’s version ordering.
Directly accessing a row reveals which node updated it.
Directly accessing a row reveals whether cross-node locks may exist.
Uses minimum active transactions to reduce conflict determination.
Uses global transaction ID information to achieve distributed retrieval of transaction commit timestamps (CTS).

Additionally:

Buffer fusion and lock fusion in PMFS appear highly dependent on the shared memory component.
RDMA is omnipresent throughout.

Log Ordering
#

Partial Order
#

First, WAL is generated on each node without any concurrency control mechanism — each writes independently to shared storage. Each node’s LSN is sequential for that node, but across multiple nodes, WAL records don’t exhibit global ordering.

But is global ordering needed when writing WAL records?

From the paper, most of the time it’s not needed.

Only one case requires guaranteed global ordering during writing: cross-node updates to the same page.

However, according to the PMFS lock fusion mechanism, cross-node updates to the same page are exclusive. Lock fusion can ensure the ordering of cross-node page updates.

Recovery Ordering
#

Since LLSNs from cross-node writes come from multiple nodes and are likely not in order, recovery needs to be done in order. Reading all WAL records and sorting by LLSN is a simple approach, but massive sorting is very resource-intensive.

PolarDB-MP proposes segment-wise sorting of LLSN — each segment is called a chunk, with chunk boundaries called LLSN bounds. PolarDB-MP can guarantee that an LLSN bound is always less than the next bound, then sort LLSNs within each chunk.

Questions About Log Ordering (Things I Didn’t Understand)
#

“utilizing redo (write-ahead) logs for data recovery and undo logs for rolling back uncommitted changes”

PolarDB-MP has undo log files? What is this undo for?

I didn’t see anything particularly special about LLSN; the paper doesn’t detail its structure. LSN seems sufficient — maybe there are differences regarding global transaction IDs.

Evaluation
#

Read-only operations are all local, so adding nodes linearly increases throughput. If read-write/write-only data is well-partitioned and doesn’t cross nodes, it’s also nearly linear.

The problem lies in shared data across read-write/write-only nodes, which is the ultimate test of distributed database performance.

The paper directly compares against Huawei’s Taurus-MM. The conclusion: PolarDB-MP’s cross-node write performance is indeed significantly better.

Nitpicking
#

The paper mentions Taurus-MM’s performance improvement under 8-node shared data in two places, but the data is inconsistent:

The eight-node cluster only improves the throughput by 1.8× compared to the single-node version in the read-write workload with 50% shared data.

the throughput of Taurus-MM’s eight-node cluster is approximately 1.8× that of a single node under the SysBench write-only workload with 30% shared data, illustrating the trade-offs and challenges in optimizing multi-primary cloud databases

Sometimes 30% shared data, sometimes 50% — not very rigorous. The original Taurus MM paper says 50%:

Summary
#

Not much to summarize — see the Foreword and Abstract and PMFS Summary sections.

Original link: https://lastdba.com/2025/11/30/论文精读polar-db-mp2024-sigmod最佳工业论文/

论文解读 on Last DBA

Paper Deep Read: Anarchy in the Database

Why This Paper #

Extension Categories #

Extension Classification #

Flexibility and Security #

How PostgreSQL Extensions Are Typically Implemented #

Extension Code Copy Rate #

The Heavyweight! — PG Extension Compatibility #

Nitpicking #

Conclusion #

Extension Security Issues #

Philosophical Thoughts on Software Management #

Version Convergence #

Consider Compatibility When Using Extensions #

Trivia #

Paper Deep Read: DBAIOps

What is DBAIOps #

Overview #

The Operations Experience Graph Model #

Graph Construction #

Anomaly Model #

Metrics #

Two-Stage Graph Evolution #

Generating Analysis Reports #

Evaluation #

Nitpicking #

Summary #

CXL and PolarDB-CXL

CXL and PolarDB-CXL #

What is CXL #

RDMA vs CXL #

CXL’s Advantages #

Custom Rack #

PolarCXLMem #

PolarRecv #

Memory Fusion #

Performance Evaluation #

Multi-Node Read/Write #

PolarRecv Recovery Time #

Shared Data Updates #

Summary #

Paper Deep Read: PolarDB-MP | 2024 SIGMOD Best Industrial Paper

Foreword and Abstract #

Classification #

Competitor Weaknesses #

Transaction Fusion #

Transaction Fusion Overview #

Local Transactions and TIT Table #

How Transactions Proceed #

Accessing Remote CTS #

Row Update Transactions #

Questions About Transaction Fusion (Things I Didn’t Understand) #

Buffer Fusion #

Buffer Fusion Introduction #

How Buffer Fusion Works #

Questions About Buffer Fusion (Things I Didn’t Understand) #

Lock Fusion #

Lock Types in Lock Fusion #

PLock Access Flow #

PLock Lazy Releasing #

RLock Overview #

How RLock Works #

Questions About Lock Fusion (Things I Didn’t Understand) #

PMFS Summary (Hot Take) #

Log Ordering #

Partial Order #

Recovery Ordering #

Questions About Log Ordering (Things I Didn’t Understand) #

Evaluation #

Nitpicking #

Summary #

Why This Paper
#

Extension Categories
#

Extension Classification
#

Flexibility and Security
#

How PostgreSQL Extensions Are Typically Implemented
#

Extension Code Copy Rate
#

The Heavyweight! — PG Extension Compatibility
#

Nitpicking
#

Conclusion
#

Extension Security Issues
#

Philosophical Thoughts on Software Management
#

Version Convergence
#

Consider Compatibility When Using Extensions
#

Trivia
#

What is DBAIOps
#

Overview
#

The Operations Experience Graph Model
#

Graph Construction
#

Anomaly Model
#

Metrics
#

Two-Stage Graph Evolution
#

Generating Analysis Reports
#

Evaluation
#

Nitpicking
#

Summary
#

CXL and PolarDB-CXL
#

What is CXL
#

RDMA vs CXL
#

CXL’s Advantages
#

Custom Rack
#

PolarCXLMem
#

PolarRecv
#

Memory Fusion
#

Performance Evaluation
#

Multi-Node Read/Write
#

PolarRecv Recovery Time
#

Shared Data Updates
#

Summary
#

Foreword and Abstract
#

Classification
#

Competitor Weaknesses
#

Transaction Fusion
#

Transaction Fusion Overview
#

Local Transactions and TIT Table
#

How Transactions Proceed
#

Accessing Remote CTS
#

Row Update Transactions
#

Questions About Transaction Fusion (Things I Didn’t Understand)
#

Buffer Fusion
#

Buffer Fusion Introduction
#

How Buffer Fusion Works
#

Questions About Buffer Fusion (Things I Didn’t Understand)
#

Lock Fusion
#

Lock Types in Lock Fusion
#

PLock Access Flow
#

PLock Lazy Releasing
#

RLock Overview
#

How RLock Works
#

Questions About Lock Fusion (Things I Didn’t Understand)
#

PMFS Summary (Hot Take)
#

Log Ordering
#

Partial Order
#

Recovery Ordering
#

Questions About Log Ordering (Things I Didn’t Understand)
#

Evaluation
#

Nitpicking
#

Summary
#