<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>论文解读 on Last DBA</title><link>https://lastdba.com/en/categories/%E8%AE%BA%E6%96%87%E8%A7%A3%E8%AF%BB/</link><description>Recent content in 论文解读 on Last DBA</description><generator>Hugo -- gohugo.io</generator><language>en-US</language><copyright>© 2026 liuzhilong62</copyright><lastBuildDate>Sat, 03 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://lastdba.com/en/categories/%E8%AE%BA%E6%96%87%E8%A7%A3%E8%AF%BB/index.xml" rel="self" type="application/rss+xml"/><item><title>Paper Deep Read: Anarchy in the Database</title><link>https://lastdba.com/en/2026/01/03/paper-deep-read-anarchy-in-the-database/</link><pubDate>Sat, 03 Jan 2026 00:00:00 +0000</pubDate><guid>https://lastdba.com/en/2026/01/03/paper-deep-read-anarchy-in-the-database/</guid><description>&lt;p&gt;Paper: Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility&lt;/p&gt;
&lt;p&gt;GitHub: &lt;a href="https://github.com/cmu-db/ext-analyzer" target="_blank" rel="noreferrer"&gt;https://github.com/cmu-db/ext-analyzer&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;PGConf: The trouble with extensions (PGConf.dev 2025)&lt;/p&gt;

&lt;h2 class="relative group"&gt;Why This Paper
 &lt;div id="why-this-paper" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#why-this-paper" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;This is a survey of database extensions (mainly Postgres), covering the implementation approaches of extensions across different databases, existing problems, and most importantly, compatibility. The most significant finding: an evaluation of over 400 PostgreSQL extensions shows that 16.8% of extensions have compatibility issues with at least one other extension, potentially leading to system failures.&lt;/p&gt;</description><content:encoded>&lt;p&gt;Paper: Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility&lt;/p&gt;
&lt;p&gt;GitHub: &lt;a href="https://github.com/cmu-db/ext-analyzer" target="_blank" rel="noreferrer"&gt;https://github.com/cmu-db/ext-analyzer&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;PGConf: The trouble with extensions (PGConf.dev 2025)&lt;/p&gt;

&lt;h2 class="relative group"&gt;Why This Paper
 &lt;div id="why-this-paper" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#why-this-paper" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;This is a survey of database extensions (mainly Postgres), covering the implementation approaches of extensions across different databases, existing problems, and most importantly, compatibility. The most significant finding: an evaluation of over 400 PostgreSQL extensions shows that 16.8% of extensions have compatibility issues with at least one other extension, potentially leading to system failures.&lt;/p&gt;
&lt;p&gt;Analysis tools and results are on GitHub; Marco Slot&amp;rsquo;s presentation is at PGConf.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Extension Categories
 &lt;div id="extension-categories" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#extension-categories" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Extension Classification
 &lt;div id="extension-classification" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#extension-classification" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;The extension classification chapter is particularly lengthy — a single diagram actually clarifies everything.&lt;/p&gt;
&lt;p&gt;Extensions across 6 databases:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/c2025f80a5c9.png" alt="image-20251228140624785" /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PostgreSQL (1986): Written in C, designed from the beginning as an extensible architecture. Consequently, PostgreSQL has the richest and most diverse extensible ecosystem.&lt;/li&gt;
&lt;li&gt;MySQL (1994): Written in C++, best known for its storage engine plugin architecture.&lt;/li&gt;
&lt;li&gt;MariaDB (2009): A fork of MySQL, also C++ based, supporting more extensions than the original MySQL.&lt;/li&gt;
&lt;li&gt;SQLite (2000): Embedded database written in C, adaptable to various hardware devices and operating systems.&lt;/li&gt;
&lt;li&gt;Redis (2009): In-memory key-value store written in C++, uniquely extensible — only supports running above the DBMS key-value storage layer.&lt;/li&gt;
&lt;li&gt;DuckDB (2018): Embedded analytical database written in C++, with a rapidly emerging extensible ecosystem.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;Flexibility and Security
 &lt;div id="flexibility-and-security" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#flexibility-and-security" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Extension security and flexibility are a trade-off — PG extensions are the most flexible but least secure; Redis is the most secure but least flexible:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/a4b3110396a3.png" alt="image-20260103140026801" /&gt;&lt;/p&gt;

&lt;h3 class="relative group"&gt;How PostgreSQL Extensions Are Typically Implemented
 &lt;div id="how-postgresql-extensions-are-typically-implemented" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#how-postgresql-extensions-are-typically-implemented" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;PG generally has two ways to implement extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Through handler functions, such as UDFs, UDTs, external tables, storage engines, and index access methods.&lt;/li&gt;
&lt;li&gt;Through hooks. Hooks are declared as function pointers in global variables; if a hook is set, it will call these pointers instead of its own code.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Implementations may use both approaches — they&amp;rsquo;re not mutually exclusive. The other 5 databases have generally similar implementations, but &lt;strong&gt;none of them have hook-based implementations&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/a0199291618a.png" alt="image-20251228170307440" /&gt;&lt;/p&gt;
&lt;p&gt;Extensions may use different implementation approaches, e.g., function + types + index AM — this is the number of extensibility types. From Figure 1, we can see that extensions with 1-3 types are the most common, and the most-used implementation approach is function.&lt;/p&gt;
&lt;p&gt;From Table 3, 92.5% of extensions use UDFs — after all, it&amp;rsquo;s a user-facing feature, easiest to develop with the lowest barrier to entry. The least used is client authentication, as this scenario itself is uncommon.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Extension Code Copy Rate
 &lt;div id="extension-code-copy-rate" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#extension-code-copy-rate" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;The paper also conducted an interesting survey: the extent to which extension code is copied from built-in code:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/69d841de287f.png" alt="image-20260103104107929" /&gt;&lt;/p&gt;
&lt;p&gt;Out of 441 extensions, 16.6% — 73 extensions — contain at least one line copied from PG source code. The detailed distribution is shown in the left chart above.&lt;/p&gt;
&lt;p&gt;Why are so many extensions copying code? Because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Some functions in PG source are declared static, only callable within their own file, so they can only be copied.&lt;/li&gt;
&lt;li&gt;Due to the extension&amp;rsquo;s own requirements, functions may need slight adjustments, so they can only be copied and adjusted.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And how much were these copied functions adjusted? See the right chart above.&lt;/p&gt;
&lt;p&gt;As can be seen, unmodified copies are actually rare.&lt;/p&gt;
&lt;p&gt;In summary, extension code is copied from PG source out of necessity, and the overall copy rate isn&amp;rsquo;t high.&lt;/p&gt;

&lt;h2 class="relative group"&gt;The Heavyweight! — PG Extension Compatibility
 &lt;div id="the-heavyweight--pg-extension-compatibility" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-heavyweight--pg-extension-compatibility" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;This is the most interesting part of the paper: pairwise compatibility testing was conducted on 96 extensions, and testing found that 16.8% of extension pairs are incompatible!&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/6d87c80af09b.png" alt="image-20260103111359805" /&gt;&lt;/p&gt;
&lt;p&gt;Testing methodology:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Installation. Yes, installation alone can cause problems. The authors tested both A→B and B→A installation orders, hence the asymmetric diagram.&lt;/li&gt;
&lt;li&gt;Running the extension&amp;rsquo;s provided unit tests.&lt;/li&gt;
&lt;li&gt;pgbench. Smoke testing. pgbench is of course simple, but good results here can still indicate something.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Among the top 20 least compatible extensions, many commonly-used ones appear:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Common extensions: pg_hint_plan, vector, pg_show_plans, pgsentinel, pg_cron, pg_stat_kcache&lt;/li&gt;
&lt;li&gt;Heavy extensions: citus, timescaledb&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fact that such extremely common and star extensions can have such poor compatibility is jaw-dropping.&lt;/p&gt;
&lt;p&gt;What&amp;rsquo;s even more chilling: this is just simple pairwise testing. Running 3-10 extensions should be the production norm, and production environments are far more complex and variable than the paper&amp;rsquo;s three testing methods.&lt;/p&gt;
&lt;p&gt;Finally, the paper identifies the reason for poor extension compatibility: extensions that use more components, extension types, and hooks are more likely to be incompatible with other extensions.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Nitpicking
 &lt;div id="nitpicking" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#nitpicking" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;It&amp;rsquo;s really still about Postgres&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The paper&amp;rsquo;s title says DBMS, but it&amp;rsquo;s mainly about PG compatibility. MySQL, Redis, etc. compatibility is only covered in the survey, with no experimental data at all. (Though the survey is interesting — you can learn how MySQL and Redis extensions are implemented.)&lt;/p&gt;
&lt;p&gt;On the other hand, this paper has a kind of alternative &amp;ldquo;general-specific-general&amp;rdquo; feel: &amp;ldquo;DBMS-Postgres-DBMS&amp;rdquo; &amp;#x1f605;&lt;/p&gt;
&lt;ol start="2"&gt;
&lt;li&gt;&lt;strong&gt;Insufficient compatibility testing&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;PG has 400+ extensions, but only 96 were tested for compatibility, and only 1-on-1 compatibility testing, without tests involving 3 or more extensions. The compatibility testing isn&amp;rsquo;t particularly comprehensive.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Conclusion
 &lt;div id="conclusion" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#conclusion" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;PG extensions are indeed numerous and flexible — you&amp;rsquo;d struggle to find functionality that PG extensions &lt;em&gt;don&amp;rsquo;t&lt;/em&gt; support. But the extensions themselves are almost in a state of &amp;ldquo;anarchy&amp;rdquo; — both extension development and usage have problems.&lt;/p&gt;
&lt;p&gt;From the compatibility results, extension compatibility is quite poor — even the installation order affects compatibility. Multiple extensions also depend on hook execution order; for example, two extensions both requiring themselves to execute last becomes awkward. &amp;ldquo;Having everything&amp;rdquo; doesn&amp;rsquo;t mean &amp;ldquo;install everything.&amp;rdquo;&lt;/p&gt;

&lt;h3 class="relative group"&gt;Extension Security Issues
 &lt;div id="extension-security-issues" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#extension-security-issues" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;PG extensions have virtually no security management, whether from inherently unsafe extensions or user privilege escalation through extensions.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;If an extension contains unsafe languages, only the OS can restrict its behavior, not the DBMS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If an extension can access user space, the OS layer cannot manage it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Extensions implemented through queries (e.g., UDFs) generally won&amp;rsquo;t bypass ACL policies. While UDFs are more secure, they&amp;rsquo;re not absolutely secure, as UDFs with admin privileges can exist.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A single hook may not be restricted by ACL, because in PostgreSQL, ACL is only enforced at the planning and execution layers. PG provides &lt;code&gt;SECURITY LABEL&lt;/code&gt; to restrict access control for objects (including extensions).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;Philosophical Thoughts on Software Management
 &lt;div id="philosophical-thoughts-on-software-management" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#philosophical-thoughts-on-software-management" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;&amp;ldquo;If an extension contains unsafe languages, only the OS can restrict its behavior, not the DBMS.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;This statement itself isn&amp;rsquo;t wrong, but it carries an implication of &amp;ldquo;your directory could be deleted.&amp;rdquo; To counter this, consider the following:&lt;/p&gt;
&lt;p&gt;If you use this software, you trust it, just like PG itself (but even when using PG, you create a postgres OS user rather than using root directly). As for extensions, treat them as part of the PG software. PG is trusted and can be installed directly in production because of its industry reputation. The same goes for extensions — choose reputable extensions rather than using them indiscriminately. This is essentially the difference between PostgreSQL community gatekeeping and extension provider gatekeeping. For cloud service providers, many extensions aren&amp;rsquo;t supported — the cloud provider assumes the gatekeeping function and the responsibility of taking the blame.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Version Convergence
 &lt;div id="version-convergence" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#version-convergence" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;PG extension versions have these characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The same extension may have different extension packages for different database versions.&lt;/li&gt;
&lt;li&gt;Extensions have different versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means that without version management, you&amp;rsquo;ll end up with unmanageable numbers of software versions. To address this, limiting specific PG versions to installing specific extension versions is a good approach. As for extension upgrades needed for certain requirements, implement them through PG version upgrades. This strategy sacrifices some flexibility to ensure stability. I personally think it&amp;rsquo;s worthwhile — the need to upgrade extensions itself isn&amp;rsquo;t common, but it can reduce many software management issues and unknown compatibility problems.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Consider Compatibility When Using Extensions
 &lt;div id="consider-compatibility-when-using-extensions" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#consider-compatibility-when-using-extensions" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Since extension compatibility isn&amp;rsquo;t great, &lt;strong&gt;managing extensions becomes especially important&lt;/strong&gt; — we don&amp;rsquo;t want the database returning strange results or even crashing while running.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extension management strategy: 1. Install necessary extensions. 2. Create needed extensions on demand. 3. Don&amp;rsquo;t install obscure extensions.&lt;/li&gt;
&lt;li&gt;Search the compatibility matrix. While PG compatibility testing isn&amp;rsquo;t perfect, it&amp;rsquo;s still valuable. Since the paper isn&amp;rsquo;t directly searchable for the compatibility matrix, you can &amp;ldquo;ctrl+f&amp;rdquo; search the &lt;a href="https://github.com/cmu-db/ext-analyzer/blob/main/plot_scripts/csvs/compatibility_results.csv" target="_blank" rel="noreferrer"&gt;ext-analyzer compatibility table&lt;/a&gt; to preliminarily assess whether extensions you need have good compatibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Trivia
 &lt;div id="trivia" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#trivia" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;In the 1976 INGRES paper, UDFs were already implemented through extensions. Even POSTGRES carried forward this functionality in its 1986 initial release. Oracle&amp;rsquo;s UDF implementation came in Oracle 7, released in &lt;a href="https://www.orafaq.com/wiki/Oracle_7" target="_blank" rel="noreferrer"&gt;1992&lt;/a&gt; — much later than PG.&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/da3915cf7a37.png" alt="image-20251228104850349" /&gt;&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/2830dc6d8871.png" alt="image-20251228104840046" /&gt;&lt;/p&gt;
&lt;p&gt;The SQL standard didn&amp;rsquo;t include UDFs until 1996 — a full 20 years after INGRES&amp;rsquo;s UDF. Stonebraker indeed wasn&amp;rsquo;t very focused on driving standards.&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Original link: &lt;a href="https://lastdba.com/2026/01/03/" target="_blank" rel="noreferrer"&gt;https://lastdba.com/2026/01/03/&lt;/a&gt;论文精读插件无政府状态/&lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title>Paper Deep Read: DBAIOps</title><link>https://lastdba.com/en/2025/12/21/paper-deep-read-dbaiops/</link><pubDate>Sun, 21 Dec 2025 00:00:00 +0000</pubDate><guid>https://lastdba.com/en/2025/12/21/paper-deep-read-dbaiops/</guid><description>&lt;p&gt;Paper: &lt;a href="https://www.arxiv.org/pdf/2508.01136" target="_blank" rel="noreferrer"&gt;DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Repo: &lt;a href="https://github.com/weAIDB/DBAIOps/" target="_blank" rel="noreferrer"&gt;https://github.com/weAIDB/DBAIOps/&lt;/a&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;What is DBAIOps
 &lt;div id="what-is-dbaiops" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-is-dbaiops" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Why DBAIOps:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Manual operations are extremely time-consuming.&lt;/li&gt;
&lt;li&gt;Manual operations are difficult to scale.&lt;/li&gt;
&lt;li&gt;Manual operations are often trapped in recurring failures.&lt;/li&gt;
&lt;li&gt;Documentation + RAG models are inaccurate (limited DBA experience integration).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, both manual operations and existing solutions are mediocre, hence DBAIOps — &lt;strong&gt;an operations system combining LLM reasoning and knowledge graphs to achieve DBA-like diagnostic capabilities&lt;/strong&gt;.&lt;/p&gt;</description><content:encoded>&lt;p&gt;Paper: &lt;a href="https://www.arxiv.org/pdf/2508.01136" target="_blank" rel="noreferrer"&gt;DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Repo: &lt;a href="https://github.com/weAIDB/DBAIOps/" target="_blank" rel="noreferrer"&gt;https://github.com/weAIDB/DBAIOps/&lt;/a&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;What is DBAIOps
 &lt;div id="what-is-dbaiops" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-is-dbaiops" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Why DBAIOps:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Manual operations are extremely time-consuming.&lt;/li&gt;
&lt;li&gt;Manual operations are difficult to scale.&lt;/li&gt;
&lt;li&gt;Manual operations are often trapped in recurring failures.&lt;/li&gt;
&lt;li&gt;Documentation + RAG models are inaccurate (limited DBA experience integration).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, both manual operations and existing solutions are mediocre, hence DBAIOps — &lt;strong&gt;an operations system combining LLM reasoning and knowledge graphs to achieve DBA-like diagnostic capabilities&lt;/strong&gt;.&lt;/p&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Comparison of database failure analysis approaches:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Rule-based approach: Traditional, rigid.&lt;/li&gt;
&lt;li&gt;Machine learning approach: Essentially rule-based with similar limitations; depends on training data leading to lower generation capability; generally suitable for diagnosing common specific problems.&lt;/li&gt;
&lt;li&gt;LLM-based approach: Uses general documentation and LLMs (e.g., decision-tree-based), prone to giving generic results.&lt;/li&gt;
&lt;li&gt;LLM+RAG approach: Searches based on chunked top-k approximate knowledge; results are inaccurate.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start="3"&gt;
&lt;li&gt;After comparing the above approaches, the advantages of &lt;strong&gt;DBAIOps combining graph knowledge, DBA experience, and LLMs are clear:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Incorporates DBA experience.&lt;/li&gt;
&lt;li&gt;Preserves original relationships.&lt;/li&gt;
&lt;li&gt;Supports new root cause identification and solutions.&lt;/li&gt;
&lt;li&gt;Extensible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Overview
 &lt;div id="overview" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#overview" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/0901d29f4881.png" alt="image-20251214092938211" /&gt;&lt;/p&gt;
&lt;p&gt;Left side is architecture, right side is an example.&lt;/p&gt;
&lt;p&gt;Offline: DBA experience is embedded into Neo4j, with the resulting graph model called ExperienceGraph, where edges represent anomaly phenomena or metric relationships. The embedded anomaly model is called AnomalyModel.&lt;/p&gt;
&lt;p&gt;Online: Anomaly analysis, retrieval, and report generation. The AnomalyProcessor extracts standard failure information and AnomalyModel information, then retrieves the graph via ExperienceRetriever; finally, RootCauseAnalyzer calls the LLM to generate analysis reports.&lt;/p&gt;
&lt;p&gt;From the right-side example, we can see graph relevance finding LOG FILE SYNC associated with LOG WRITE performance and IO performance; through REDO ALLOCATION, we can find table structure changes and DDL.&lt;/p&gt;

&lt;h2 class="relative group"&gt;The Operations Experience Graph Model
 &lt;div id="the-operations-experience-graph-model" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-operations-experience-graph-model" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Unlike rule-based or document-chunk-based RAG, ExperienceGraph is a graph model encoding heterogeneous operations experience information. The graph contains three elements: (vertices, directed edges, relationships on edges).&lt;/p&gt;
&lt;p&gt;Based on the characteristics of operations experience, DBAIOps classifies vertices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;trigger vertex: Used to detect database anomalies; the entry point for anomaly analysis. For example, LOG FILE SYNC is an entry vertex.&lt;/li&gt;
&lt;li&gt;metric vertex: Database runtime metrics. For offline knowledge, this refers to metrics from operations case studies (if present).&lt;/li&gt;
&lt;li&gt;experience vertex: Encodes domain-specific operations experience, covering anomaly meanings and handling methods. For example, LOG FILE SYNC exceeding 60ms indicates overly frequent commits or parameter adjustments needed.&lt;/li&gt;
&lt;li&gt;tool vertex: Executable scripts for collecting and analyzing anomaly metrics.&lt;/li&gt;
&lt;li&gt;tag vertex: Semantic categories of graph vertices. For example, &amp;ldquo;Concurrent Transactions&amp;rdquo; involves multiple vertex types; tag vertices strengthen cross-case associations.&lt;/li&gt;
&lt;li&gt;auxiliary vertex: Explains the meaning of metrics.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Edge classification:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;containment edge: Trigger Vertex - Experience Vertex&lt;/li&gt;
&lt;li&gt;relevance edge: Trigger Vertex - Metric Vertex&lt;/li&gt;
&lt;li&gt;diagnosis edge: Experience Vertex - Metric Vertex&lt;/li&gt;
&lt;li&gt;synonym edge: Only appears between Tag Vertices, indicating semantic synonymy, e.g., physical_read and disk_read; shared_pool and shared_buffer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Analyzing the operations experience graph model through an example:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/af2763d6d88b.png" alt="image-20251215210049114" /&gt;&lt;/p&gt;
&lt;p&gt;LOG FILE SYNC has multiple TAGs, and TAGs are associated with Experience, metrics, and tools. The strong relevance is evident — it represents a human DBA&amp;rsquo;s understanding and operations experience of LOG FILE SYNC.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Graph Construction
 &lt;div id="graph-construction" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#graph-construction" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Manual graph construction is unreliable, and existing ML-generated graphs may generate irrelevant relationships, so a semi-automatic graph generation approach is proposed.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Graph initialization: This part is manually generated, defining trigger vertices according to rules. Once trigger vertices are generated, their associated metric vertices, experience vertices, etc., are automatically generated. This is somewhat like a human DBA guiding the creation of a knowledge sketch — the overall framework cannot be changed; nothing bizarre should be generated.&lt;/li&gt;
&lt;li&gt;Graph storage: Stored in Neo4J. Additionally, different database types are marked with tags, making much knowledge reusable and avoiding duplicate graph construction.&lt;/li&gt;
&lt;li&gt;Graph augmentation: Generating more edges.&lt;/li&gt;
&lt;li&gt;Graph updates: DBAIOps supports incremental updates. Updates here include both adding new vertices and removing old vertices.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Anomaly Model
 &lt;div id="anomaly-model" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#anomaly-model" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Metrics
 &lt;div id="metrics" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#metrics" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Metrics come from many sources, including runtime information (CPU %, throughput, etc., routine monitoring), logs, traces, etc. Combined with relevance differences, strongly correlated metrics need to be extracted. So metrics are divided into 2 categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Immediately collected metrics: Runtime information, logs, traces.&lt;/li&gt;
&lt;li&gt;Subsequently collected metrics: Periodic, delta, etc., metrics generated when needed, such as AWR/ASH data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Regarding metric-anomaly correlation, unlike baseline-based approaches, DBAIOps uses specific metric combinations for each anomaly type.&lt;/p&gt;
&lt;p&gt;Finally, a formula determines whether an anomaly has actually occurred:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/3693e91d7723.png" alt="image-20251214093339574" /&gt;&lt;/p&gt;

&lt;h3 class="relative group"&gt;Two-Stage Graph Evolution
 &lt;div id="two-stage-graph-evolution" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#two-stage-graph-evolution" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Database anomalies rarely occur in isolation — one performance issue may simultaneously trigger or exacerbate others. However, connections between different anomaly models (e.g., LOG_FILE_SYNC and REDO_ALLOCATION) in pre-built knowledge graphs tend to be loose, with shared experience fragments sparse and fragmented. This makes it difficult for traditional methods to discover cross-model composite root causes, such as combined I/O bottleneck and memory pressure issues.&lt;/p&gt;
&lt;p&gt;To address this challenge, DBAIOps proposes an automatic &amp;ldquo;graph evolution&amp;rdquo; mechanism that dynamically discovers and connects relevant experience fragments between different anomaly models, evolving the knowledge graph from an initially sparse structure into a densely interconnected network, thus supporting more comprehensive root cause analysis.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Stage 1 - Graph Inference and Proximity Discovery: Uses graph query language (Cypher) to collect and aggregate relevant metrics, traversing related nodes and edges based on configurable thresholds to build association networks. For example, starting from LOG_FILE_SYNC latency, traverse up to 3 hops of associated nodes. Establish connections between LOG_FILE_SYNC and REDO_ALLOCATION models because they are both related to I/O-related concurrency issues. Through multiple iterations, the knowledge graph gradually evolves into a denser structure, enabling diagnosis to consider more potential factors and composite causes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Stage 2 - Adaptive Abnormal Metric Detection: Identifies truly anomalous metrics along graph expansion paths. Using an Adaptive Detection Function (ADF), it calculates composite anomaly scores considering dimensions such as metric volatility and dynamic baseline deviation. Based on anomaly scoring results, it decides whether further knowledge graph structure expansion is needed, filtering a precise subset of anomaly metrics for subsequent LLM root cause reasoning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/313bde49387f.png" alt="image-20251214103841593" /&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;Generating Analysis Reports
 &lt;div id="generating-analysis-reports" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#generating-analysis-reports" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Once the graph is ready, prompts need to be fed to the LLM to generate desired reports. A well-structured prompt can also improve report accuracy.&lt;/p&gt;
&lt;p&gt;Anomalies have 5 components, which serve as the prompt for the LLM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Anomaly: Anomaly description (&amp;ldquo;CPU usage spiked to 95% at 16:00 on 2023-10-05&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Condition: Anomaly trigger condition (&amp;ldquo;exceeds 90% for &amp;gt;5 min&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;Metrics&lt;/li&gt;
&lt;li&gt;Experience: Provides normal load values or recent maintenance tasks.&lt;/li&gt;
&lt;li&gt;Output: Describes the report&amp;rsquo;s composition — anomaly verification (requiring further analysis), root cause analysis, recovery plan, summary, SQL text.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Some personal thoughts&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;Recent maintenance tasks are very useful — maintenance tasks generally have strong correlation, and failure analysis can&amp;rsquo;t just be simple technical analysis. However, who updates these maintenance tasks and which ones to update or not update is a problem.&lt;/p&gt;
&lt;p&gt;The first few items in output are easy to understand, but the last one — SQL text — is a stroke of genius. In production environments, aside from hardware failures, database runtime status is strongly correlated with SQL. I personally believe you can unthinkingly capture SQL and discuss causality later. From an operations perspective, failures always require joint investigation with developers, so SQL text is basically mandatory to capture.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Evaluation
 &lt;div id="evaluation" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#evaluation" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Comparison of analysis report quality across different tools and approaches:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/2f2ecc9b755b.png" alt="image-20251215082259815" /&gt;&lt;/p&gt;
&lt;p&gt;Impressive results. Notably, DBAIOps specifically emphasizes that mid-sized LLMs already produce good analysis results. This is important — DeepSeek-R1 671B running bare isn&amp;rsquo;t bad, but the cost is on a completely different level.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Nitpicking
 &lt;div id="nitpicking" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#nitpicking" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Can&amp;rsquo;t really be called &amp;ldquo;Ops&amp;rdquo; — it only has failure analysis functionality. Ops content is vast; failure analysis is just the tip of the iceberg.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Graph classification doesn&amp;rsquo;t match the graph example. The defined tag vertices and edges differ significantly from the example.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The vertices in the example play important roles, but these edge types aren&amp;rsquo;t defined: tag vertex-tool vertex, tag vertex-experience vertex, tag vertex-metric vertex. And the edges that should exist seem mostly absent, with only synonym edges present.&lt;/p&gt;
&lt;p&gt;Undescribed parts of the example should be listed, otherwise it&amp;rsquo;s confusing.&lt;/p&gt;
&lt;ol start="3"&gt;
&lt;li&gt;The two-stage graph evolution results are a bit odd:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/5fd6929e7dee.png" alt="image-20251214165952773" /&gt;&lt;/p&gt;
&lt;p&gt;w/o ADF means without Stage 2 graph evolution (adaptive abnormal metric detection).
w/o ADF should mean without Stage 1 graph evolution (graph inference and proximity discovery).
w/o ADF means without either stage of graph evolution.&lt;/p&gt;
&lt;p&gt;Here, the case with both stages of graph evolution is missing — having it would better demonstrate the effectiveness of two-stage graph evolution.&lt;/p&gt;
&lt;ol start="4"&gt;
&lt;li&gt;Root causes are somewhat limited:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/4e7faebf0b1a.png" alt="image-20251214114018609" /&gt;&lt;/p&gt;
&lt;p&gt;The circled ones should be relatively common (I only looked at Oracle and Postgres), but these root causes are currently absent.&lt;/p&gt;
&lt;p&gt;PG&amp;rsquo;s root causes are a bit sparse. Dirty page flushing generally isn&amp;rsquo;t a major issue — as a root cause, it probably ranks behind many other root causes.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Summary
 &lt;div id="summary" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#summary" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Points I personally really like:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;GraphRAG should be better than vector RAG for failure diagnosis.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/eddb311c9614.png" alt="image-20251215212534234" /&gt;&lt;/p&gt;
&lt;p&gt;(GraphRAG original paper: &lt;a href="https://arxiv.org/pdf/2404.16130" target="_blank" rel="noreferrer"&gt;From Local to Global: A GraphRAG Approach to Query-Focused Summarization&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;SS represents vector RAG, TS represents source text summaries, and C0/C1/C2/C3 represent GraphRAG at different knowledge granularities. From this chart, we can simply conclude: GraphRAG is better suited for multi-document complex scenarios and multi-angle analysis, but may not necessarily outperform vector RAG in precision.&lt;/p&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Semi-automatic graph generation approach.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Graph generation is semi-automatic — trigger vertices are manually created, others can be auto-generated. For example, LOG FILE SYNC is a trigger vertex. Failure entry points can indeed be made into obvious anomaly points — these are the entry points. Same for PG, same for any failure — it aligns with human logic for understanding failures.&lt;/p&gt;
&lt;ol start="3"&gt;
&lt;li&gt;Automatic graph evolution.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Strengthening associations between certain vertices is meaningful, as evident from the &amp;ldquo;Performance of DBAIOps Variants&amp;rdquo; table.&lt;/p&gt;
&lt;ol start="4"&gt;
&lt;li&gt;Automatic baseline adjustment.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In &lt;em&gt;Observability Engineering&lt;/em&gt;, there&amp;rsquo;s this passage about AIOps:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;AI can only help when there are clearly discernible patterns and it can identify shifting baselines for prediction — such AIOps doesn&amp;rsquo;t exist yet.&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;DBAIOps in my eyes:&lt;/p&gt;
&lt;p&gt;Clearly discernible patterns = DBAIOps&amp;rsquo;s graph, which includes failure models, anomaly relationships, monitoring data, and logs.&lt;/p&gt;
&lt;p&gt;Shifting baselines = DBAIOps&amp;rsquo;s adaptive abnormal metric detection.&lt;/p&gt;
&lt;p&gt;In summary, it&amp;rsquo;s a significant advancement over random chunking of failure knowledge, setting a single baseline, and vector approximate search in RAG models.&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Original link: &lt;a href="https://lastdba.com/2025/12/21/" target="_blank" rel="noreferrer"&gt;https://lastdba.com/2025/12/21/&lt;/a&gt;论文精读dbaio-ps/&lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title>CXL and PolarDB-CXL</title><link>https://lastdba.com/en/2025/11/30/cxl-and-polardb-cxl/</link><pubDate>Sun, 30 Nov 2025 00:00:00 +0000</pubDate><guid>https://lastdba.com/en/2025/11/30/cxl-and-polardb-cxl/</guid><description>&lt;p&gt;Paper: Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases&lt;/p&gt;
&lt;p&gt;SIGMOD best paper: &lt;a href="https://sigmod.org/sigmod-awards/sigmod-best-paper-award/" target="_blank" rel="noreferrer"&gt;https://sigmod.org/sigmod-awards/sigmod-best-paper-award/&lt;/a&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;CXL and PolarDB-CXL
 &lt;div id="cxl-and-polardb-cxl" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#cxl-and-polardb-cxl" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;What is CXL
 &lt;div id="what-is-cxl" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-is-cxl" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;CXL&lt;/strong&gt;: An open industry standard, a high-speed interconnect specification formulated by the CXL Consortium (founded in 2019 by tech giants Intel, AMD, ARM, etc.). It represents the evolutionary direction of computing architecture. Currently at CXL 4.0.&lt;/p&gt;</description><content:encoded>&lt;p&gt;Paper: Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases&lt;/p&gt;
&lt;p&gt;SIGMOD best paper: &lt;a href="https://sigmod.org/sigmod-awards/sigmod-best-paper-award/" target="_blank" rel="noreferrer"&gt;https://sigmod.org/sigmod-awards/sigmod-best-paper-award/&lt;/a&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;CXL and PolarDB-CXL
 &lt;div id="cxl-and-polardb-cxl" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#cxl-and-polardb-cxl" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;What is CXL
 &lt;div id="what-is-cxl" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-is-cxl" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;CXL&lt;/strong&gt;: An open industry standard, a high-speed interconnect specification formulated by the CXL Consortium (founded in 2019 by tech giants Intel, AMD, ARM, etc.). It represents the evolutionary direction of computing architecture. Currently at CXL 4.0.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Feature&lt;/th&gt;
 &lt;th&gt;CXL 1.0/1.1&lt;/th&gt;
 &lt;th&gt;CXL 2.0&lt;/th&gt;
 &lt;th&gt;CXL 3.0/3.1&lt;/th&gt;
 &lt;th&gt;CXL 4.0 (latest)&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;March/Sept 2019&lt;/td&gt;
 &lt;td&gt;October 2020&lt;/td&gt;
 &lt;td&gt;August 2022 / November 2023&lt;/td&gt;
 &lt;td&gt;November 2025&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Base Protocol&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;PCIe 5.0 (32 GT/s)&lt;/td&gt;
 &lt;td&gt;PCIe 5.0 (32 GT/s)&lt;/td&gt;
 &lt;td&gt;PCIe 6.0 (64 GT/s)&lt;/td&gt;
 &lt;td&gt;PCIe 7.0 (128 GT/s)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Max Bandwidth&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;1TB/s&lt;/td&gt;
 &lt;td&gt;1TB/s&lt;/td&gt;
 &lt;td&gt;2TB/s&lt;/td&gt;
 &lt;td&gt;4TB/s+&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Topology Scale&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Point-to-point / simple star&lt;/td&gt;
 &lt;td&gt;Single switch (≤32 nodes)&lt;/td&gt;
 &lt;td&gt;Multi-level Fabric (4096 nodes)&lt;/td&gt;
 &lt;td&gt;Ultra-large-scale Fabric&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From my research, two descriptions of CXL left the deepest impression:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Memory as a Service&lt;/li&gt;
&lt;li&gt;Near-memory computing and expansion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CXL switch&lt;/strong&gt;: A switching chip, physical hardware. Many vendors are working on industrial implementations. The paper specifically references products from XConn Tech: &lt;a href="https://www.xconn-tech.com/products" target="_blank" rel="noreferrer"&gt;CXL 2.0 switch&lt;/a&gt;. Note that as of November 22, 2025, XConn only has CXL 2.0 switches, no 3.0 products. However, there are products on the market supporting 3.0+ standards, such as &lt;a href="https://panmnesia.com/news/en/2025-11-13-switch-sample/" target="_blank" rel="noreferrer"&gt;Panmnesia CXL 3.2 Fabric Switch&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PolarCXLMem&lt;/strong&gt;: According to the paper, &amp;ldquo;the first CXL-switch-based disaggregated memory system.&amp;rdquo; But the paper also states &amp;ldquo;we leverage the world&amp;rsquo;s first CXL switch[50]&amp;rdquo; — specifically referring to the XConn tech CXL 2.0 switch — and then says &amp;ldquo;PolarCXLMem is the first CXL-switch-based disaggregated memory.&amp;rdquo; This can be interpreted in two ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The first disaggregated memory system based on CXL switches&lt;/li&gt;
&lt;li&gt;The first disaggregated memory system based on XConn tech CXL 2.0 switches&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;PolarDB-CXL&lt;/strong&gt;: The paper doesn&amp;rsquo;t actually use this term, but the industry uses it. It represents &amp;ldquo;integrate &lt;em&gt;PolarCXLMem&lt;/em&gt; into the multi-primary version of PolarDB, known as PolarDB-MP&amp;rdquo; — essentially &amp;ldquo;&lt;strong&gt;the CXL-upgraded version of PolarDB-MP&lt;/strong&gt;.&amp;rdquo; The paper repeatedly uses lengthy phrases but never uses the term polardb-cxl. For convenience, this article uses polardb-cxl to represent its essential meaning.&lt;/p&gt;

&lt;h3 class="relative group"&gt;RDMA vs CXL
 &lt;div id="rdma-vs-cxl" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#rdma-vs-cxl" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;PolarDB-MP uses RDMA architecture, while PolarDB-CXL uses CXL architecture:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/fc236ee67755.png" alt="image-20251122115316339" /&gt;&lt;/p&gt;
&lt;p&gt;(&lt;a href="https://medium.com/@anan.mirji/cxl-switch-vs-rdma-a-technical-comparison-for-high-performance-interconnects-6aaa031cde31" target="_blank" rel="noreferrer"&gt;https://medium.com/@anan.mirji/cxl-switch-vs-rdma-a-technical-comparison-for-high-performance-interconnects-6aaa031cde31&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;RDMA architecture is a cross-host distributed interconnect architecture, while CXL architecture is a single-host expanded interconnect architecture.&lt;/p&gt;
&lt;p&gt;Key differences:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Dimension&lt;/th&gt;
 &lt;th&gt;RDMA Architecture&lt;/th&gt;
 &lt;th&gt;CXL Architecture&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Topology&lt;/td&gt;
 &lt;td&gt;Multi-host + network switch distributed arch&lt;/td&gt;
 &lt;td&gt;Single-host + CXL switch expanded arch&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Communication&lt;/td&gt;
 &lt;td&gt;Network (InfiniBand/RoCE)&lt;/td&gt;
 &lt;td&gt;PCIe bus (CXL based on PCIe physical layer)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Core Components&lt;/td&gt;
 &lt;td&gt;RDMA NIC (dedicated NIC)&lt;/td&gt;
 &lt;td&gt;CXL Controller, CXL Switch&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Resource Ownership&lt;/td&gt;
 &lt;td&gt;&amp;ldquo;Remote resources&amp;rdquo; across independent hosts&lt;/td&gt;
 &lt;td&gt;&amp;ldquo;Expanded resources&amp;rdquo; within the host architecture&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 class="relative group"&gt;CXL&amp;rsquo;s Advantages
 &lt;div id="cxls-advantages" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#cxls-advantages" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;CXL&amp;rsquo;s advantages over RDMA:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Low latency: CXL connects to host or device memory via PCIe; RDMA requires protocol interface conversion between InfiniBand and PCIe.&lt;/p&gt;
&lt;p&gt;Instruction support: CXL provides native load/store instructions, allowing the CPU to directly manipulate remote CXL device memory as if it were local memory. RDMA requires reading from remote memory to local memory, processing locally, then writing back to remote memory.&lt;/p&gt;
&lt;p&gt;Simplified applications: RDMA requires special interfaces and drivers, needing professionals to design complex programs; CXL provides transparent memory space, greatly simplifying application design.&lt;/p&gt;
&lt;p&gt;Memory fusion: CXL 3.0 supports physical hardware-level memory pooling.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Problems with PolarDB-MP and the value CXL provides:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;CXL&amp;rsquo;s critique of MP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Memory pages are 4-16K, so even when only a small amount of data transfer is needed, data must move between local and shared memory, causing read/write amplification.&lt;/li&gt;
&lt;li&gt;Maintaining local memory adds extra memory overhead, reducing throughput.&lt;/li&gt;
&lt;li&gt;Recovery is very time-consuming.&lt;/li&gt;
&lt;li&gt;RDMA is far better than TCP/IP, but under high concurrency, it suffers from &amp;ldquo;doorbell register implicit contention&amp;rdquo; and &amp;ldquo;cache thrashing&amp;rdquo; issues.&lt;/li&gt;
&lt;li&gt;The database itself must maintain shared memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Benefits CXL brings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Eliminates the &amp;ldquo;shared memory - local memory&amp;rdquo; hierarchical memory structure, also eliminating the maintenance overhead and read/write amplification. Because CXL load/store to local memory is fast enough, it allows directly storing all buffer pages.&lt;/li&gt;
&lt;li&gt;Uses cache lines (64B) as the minimum transfer unit between CPU cache and main memory, rather than PolarDB-MP&amp;rsquo;s 4K pages.&lt;/li&gt;
&lt;li&gt;Saves main memory. DRAM costs are very high, roughly 40-50% of server/rack costs.&lt;/li&gt;
&lt;li&gt;Simplifies system design. Minimal modifications to existing systems are important for commercial database stability.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;PolarRecv&lt;/em&gt;: An instant recovery system built on CXL. After a database crash, data and metadata remain on CXL, allowing direct reads of consistent state from CXL memory, so recovery is very fast. (This seems similar to how PG&amp;rsquo;s page cache helps fast startup after a crash.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;DRAM vs RDMA vs CXL&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/66af810eb94e.png" alt="image-20251122155133782" /&gt;&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/ae9cc24e5158.png" alt="image-20251122155109014" /&gt;&lt;/p&gt;
&lt;p&gt;When data volume is small, RDMA has significantly higher latency than CXL; with larger data, RDMA&amp;rsquo;s latency improves slightly. Local DRAM access is slightly better than CXL access.&lt;/p&gt;
&lt;p&gt;Overall, CXL memory access latency is slightly higher than DRAM but better than RDMA.&lt;/p&gt;
&lt;p&gt;Regarding CXL&amp;rsquo;s higher latency vs DRAM, the paper explains: &amp;ldquo;database buffer pool operations are more sensitive to bandwidth than latency&amp;rdquo; — for database memory, bandwidth matters more than latency.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Custom Rack
 &lt;div id="custom-rack" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#custom-rack" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Self-developed physical prototype rack. The left rack integrates two CXL switch-enabled clusters, each connected to memory devices and hosts; the right rack integrates one CXL switch connected to memory devices and hosts.&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/45333b6bf088.png" alt="image-20251122151718276" /&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;PolarCXLMem
 &lt;div id="polarcxlmem" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#polarcxlmem" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;The CXL 2.0 switch supports memory pooling, but the drivers don&amp;rsquo;t fully support it, so PolarCXLMem still designed its own CXL memory allocation and usage — it&amp;rsquo;s not fully transparent. PolarCXLMem processes CXL memory into a multi-tenant model, with different host nodes allocated different CXL memory regions.&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/ddc3509d74d0.png" alt="image-20251123094443287" /&gt;&lt;/p&gt;
&lt;p&gt;PolarCXLMem characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Nodes have their own CXL memory regions; different nodes&amp;rsquo; CXL memory does not overlap.&lt;/li&gt;
&lt;li&gt;The buffer pool is allocated at database startup (by the CXL mem manager in the diagram) and does not change during runtime.&lt;/li&gt;
&lt;li&gt;The memory unit structure in CXL mem is a block, which stores page data and page metadata, including: id (page id), lock state (whether the page is locked for update), prev/next (LRU doubly-linked list), lsn (latest log sequence number of the page).&lt;/li&gt;
&lt;li&gt;Free list / in-use list is used for LRU.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Question: PG&amp;rsquo;s page header has lsn, starting free space pointer, prune xid, etc. What does PolarDB-CXL&amp;rsquo;s page header structure look like?&lt;/em&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;PolarRecv
 &lt;div id="polarrecv" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#polarrecv" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;PolarDB-MP was designed based on RDMA, where data pages are written locally, and the disaggregated shared memory doesn&amp;rsquo;t contain the latest version of data pages. This means after a host crash, you must scan and apply all redo log files (the paper says redo, not WAL) or pages from a small amount of shared memory.&lt;/p&gt;
&lt;p&gt;CXL switches have independent power, so even if the host crashes, the latest data remains in CXL memory. PolarRecv leverages this to dramatically speed up database recovery after host crashes.&lt;/p&gt;
&lt;p&gt;However, while CXL switch memory is transparent and persistent, directly using it after a crash still requires handling these issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LRU lists may be inconsistent at crash time&lt;/li&gt;
&lt;li&gt;B-tree SMO (B-tree structure changes), such as index splits, may be inconsistent at crash time&lt;/li&gt;
&lt;li&gt;Pages being updated at crash time may be inconsistent&lt;/li&gt;
&lt;li&gt;The redo log buffer uses local DRAM. When the redo log hasn&amp;rsquo;t been flushed to disk at crash time, the page LSN in the CXL buffer pool may be greater than the LSN in the redo log file, directly violating the ARIES principle&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;PolarRecv&amp;rsquo;s design strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use mutex to protect the LRU structure. The mutex lock state indicates whether LRU was being modified at crash time. If so, LRU must be rebuilt; if not, use the LRU directly from CXL memory.&lt;/li&gt;
&lt;li&gt;During B-tree SMO, a mini-transaction protects index pages. This mini-transaction is a two-phase lock corresponding to page locks. It&amp;rsquo;s only flushed to the redo log when the mini-transaction commits. So during recovery, if an index page is found with a write lock, recover from the redo logs.&lt;/li&gt;
&lt;li&gt;PolarCXL&amp;rsquo;s read/write locks are stored in CXL memory. If a write lock still exists, it means the update was in an intermediate state at crash time and not completed. In this case, honestly read the page from the redo log file rather than reading an inconsistent page from CXL memory.&lt;/li&gt;
&lt;li&gt;During recovery, first obtain the maximum LSN from the redo log, then check the lock and LSN of pages in CXL memory. If a page&amp;rsquo;s LSN in CXL memory is greater than the max LSN, rebuild the page using redo log information rather than using the CXL memory version.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Memory Fusion
 &lt;div id="memory-fusion" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#memory-fusion" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Because PolarCXLMem is designed based on the CXL 2.0 switch, and CXL 3.0 supports memory fusion, memory fusion design is still needed. Since each node&amp;rsquo;s buffer pool is placed in isolation in PolarCXLMem, &lt;strong&gt;CXL 2.0&amp;rsquo;s memory fusion is achieved through DBP metadata management — each buffer pool only stores the page&amp;rsquo;s CXL memory address, not the page itself.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/d3b28a223927.png" alt="image-20251123142605871" /&gt;&lt;/p&gt;
&lt;p&gt;To understand this diagram, you need to distinguish between CXL memory, DBP, and local buffer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CXL memory is the physical hardware, CXL mem itself.&lt;/li&gt;
&lt;li&gt;DBP is a region carved out of CXL for managing memory fusion services.&lt;/li&gt;
&lt;li&gt;Local metadata buffer contains local buffer metadata and part of CXL.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Also understand that for each page in the buffer pool, there are two flags:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;invalid: After another node writes to the page, the current node needs to invalidate its local CPU cache.&lt;/li&gt;
&lt;li&gt;removal: When a page moves from the in-use list to the free list, all nodes must set the removal flag.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Memory fusion page access flow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The requested page is not in the local page metadata buffer:
1.1 Allocate a new meta record from the free list, and provide invalid and removal addresses to the memory fusion service via RPC.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The requested page is in the local page metadata buffer:
2.1 First check the removal flag. If removal is set, it means the memory fusion service has already reclaimed the page, and a new memory address must be requested from the memory fusion service via RPC.
2.2 Then check the invalid flag. If invalid is set, it means the page has been modified by another node, and the CPU cache must be invalidated to ensure consistency.&lt;/p&gt;
&lt;p&gt;Fusion consistency:&lt;/p&gt;
&lt;p&gt;Since CXL 2.0 doesn&amp;rsquo;t have memory fusion, CPU caches aren&amp;rsquo;t automatically updated. PolarCXL implements multi-node concurrent write control through page-level locks.&lt;/p&gt;
&lt;p&gt;Nodes must acquire read/write locks to read/write pages. &lt;strong&gt;When one node is writing to a page, other nodes cannot read or write that page.&lt;/strong&gt; After a node finishes writing, it must also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flush the CPU cache to CXL mem (cache line flush) to ensure CXL mem has the latest page version.&lt;/li&gt;
&lt;li&gt;Set the invalid flag to ensure other nodes don&amp;rsquo;t read stale page versions from their CPU caches.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Memory fusion summary:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CXL 2.0 itself supports incomplete memory fusion, meaning the database layer still needs to design a memory fusion scheme. Memory pages are accessed via CXL addresses, rather than local/remote access to entire pages as in the RDMA approach. The local CPU cache needs the database layer to flush it to ensure node data access consistency — this is a hard limitation. This also means cross-node updates still use exclusive page-level locks (the RDMA approach also uses exclusive page-level locks).&lt;/strong&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;Performance Evaluation
 &lt;div id="performance-evaluation" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#performance-evaluation" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Multi-Node Read/Write
 &lt;div id="multi-node-readwrite" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#multi-node-readwrite" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Benchmarking with 12 instances on a 192 vCPU host, comparing RDMA (PolarDB-MP) vs CXL (PolarDB-MP with PolarCXLMem) performance:&lt;/p&gt;
&lt;p&gt;Point queries:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/c5ac5a1f0d82.png" alt="image-20251124083738393" /&gt;&lt;/p&gt;
&lt;p&gt;Range queries:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/169bdabdbf3c.png" alt="image-20251125082404440" /&gt;&lt;/p&gt;
&lt;p&gt;Read-write:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/532e83b71906.png" alt="image-20251125082418710" /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Point queries: Read amplification is most severe for point queries. CXL&amp;rsquo;s bandwidth consumption is 3-4x lower than RDMA. When reaching 3 nodes, RDMA bandwidth is already saturated — adding more nodes doesn&amp;rsquo;t improve bandwidth.&lt;/li&gt;
&lt;li&gt;Range queries: Read amplification is less severe. Only at &amp;gt;4 nodes does it reach the bandwidth ceiling of 11GB/s, while CXL can still scale linearly with nodes.&lt;/li&gt;
&lt;li&gt;Read-write: Performance is similar to range queries, just with smaller differences.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;PolarRecv Recovery Time
 &lt;div id="polarrecv-recovery-time" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#polarrecv-recovery-time" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;vanilla: Refers to the general approach, probably similar to PG reading from local cache or disk (possibly polar redo).&lt;/li&gt;
&lt;li&gt;RDMA-based: Refers to PolarDB-MP where some data can be read from disaggregated shared storage.&lt;/li&gt;
&lt;li&gt;PolarRecv: Refers to continuing to read most data from CXL, with only a small amount of partial pages needing recovery from redo files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/6d0ae251efed.png" alt="image-20251125085711364" /&gt;&lt;/p&gt;
&lt;p&gt;The paper discusses recovery time in 2 phases: startup/recovery and reaching pre-crash load levels. Read-only doesn&amp;rsquo;t need recovery — as long as there&amp;rsquo;s data, you can start and take load. When writes exist, recovery is needed, and the advantage of continuing to read from CXL memory becomes apparent. The difference between 1-minute, 2-minute, and 4-minute recovery times is significant — it could be the difference between business being nearly imperceptible and noticeably impacted.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Shared Data Updates
 &lt;div id="shared-data-updates" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#shared-data-updates" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;The focal point of distributed database performance combat is updates to shared data. After PolarDB-MP crushed Taurus-MM, PolarDB-CXL also crushed PolarDB-MP:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/30660f6b6fd5.png" alt="image-20251130164309249" /&gt;&lt;/p&gt;
&lt;p&gt;At 0% shared data, the RDMA-based solution just accesses local buffers, and PolarDB-CXL just treats CXL as a memory pool. Even so, CXL-based still performs better, mainly due to the read/write amplification and bandwidth ceiling issues of the RDMA-based solution mentioned earlier.&lt;/p&gt;
&lt;p&gt;From the performance comparison chart above, it&amp;rsquo;s clear that PolarDB-CXL significantly outperforms PolarDB-MP. The data is very clear. However, note that when shared data &amp;gt;60%, PolarDB-CXL&amp;rsquo;s performance improvement becomes less significant, mainly because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Page-level locks become the bottleneck.&lt;/li&gt;
&lt;li&gt;As lock contention intensifies, processes enter sleep states, and frequent context switching further exacerbates resource contention.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Summary
 &lt;div id="summary" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#summary" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;PolarDB-CXL advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Eliminates RDMA&amp;rsquo;s &amp;ldquo;local-remote&amp;rdquo; hierarchical memory structure design.&lt;/li&gt;
&lt;li&gt;Resolves RDMA&amp;rsquo;s read/write amplification problem.&lt;/li&gt;
&lt;li&gt;Provides a CXL-based memory pool.&lt;/li&gt;
&lt;li&gt;PolarRecv, based on CXL persistent memory, enables faster database crash recovery.&lt;/li&gt;
&lt;li&gt;Benchmarking shows PolarDB-MP CXL outperforms PolarDB-MP RDMA.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;PolarDB-CXL disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cross-node updates still use page-level locks, which remain the main performance bottleneck in shared data update scenarios.&lt;/li&gt;
&lt;li&gt;The CXL 2.0 switch seems a bit dated — by the time the paper was published, switch devices supporting 3.2 were already available, and CXL 4.0 was announced in November 2025. We can predict future databases built on newer CXL standard switch devices.&lt;/li&gt;
&lt;li&gt;The paper quality isn&amp;rsquo;t actually as high as the MP paper — it mainly revolves around solutions for the CXL 2.0 switch physical hardware, which differs from the extensive database-layer design found in the PolarDB-MP paper.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;&lt;p&gt;Original link: &lt;a href="https://lastdba.com/2025/11/30/" target="_blank" rel="noreferrer"&gt;https://lastdba.com/2025/11/30/&lt;/a&gt;论文精读polar-db-cxl2025-sigmod最佳工业论文/&lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item><item><title>Paper Deep Read: PolarDB-MP | 2024 SIGMOD Best Industrial Paper</title><link>https://lastdba.com/en/2025/11/30/paper-deep-read-polardb-mp-2024-sigmod-best-industrial-paper/</link><pubDate>Sun, 30 Nov 2025 00:00:00 +0000</pubDate><guid>https://lastdba.com/en/2025/11/30/paper-deep-read-polardb-mp-2024-sigmod-best-industrial-paper/</guid><description>&lt;p&gt;Paper: PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory&lt;/p&gt;
&lt;p&gt;SIGMOD best paper: &lt;a href="https://sigmod.org/sigmod-awards/sigmod-best-paper-award/" target="_blank" rel="noreferrer"&gt;https://sigmod.org/sigmod-awards/sigmod-best-paper-award/&lt;/a&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;Foreword and Abstract
 &lt;div id="foreword-and-abstract" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#foreword-and-abstract" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;The paper opens with the problem: primary-replica architecture&amp;rsquo;s write throughput is limited by the primary. Shared-nothing architecture offers scalable multi-primary clusters that can solve the single-primary limitation, but this architecture suffers performance bottlenecks due to distributed transaction overhead. Recently, shared-storage-based cloud-native multi-primary databases have emerged, but under high-conflict scenarios, they face high conflict resolution costs and low data fusion efficiency.&lt;/p&gt;</description><content:encoded>&lt;p&gt;Paper: PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory&lt;/p&gt;
&lt;p&gt;SIGMOD best paper: &lt;a href="https://sigmod.org/sigmod-awards/sigmod-best-paper-award/" target="_blank" rel="noreferrer"&gt;https://sigmod.org/sigmod-awards/sigmod-best-paper-award/&lt;/a&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;Foreword and Abstract
 &lt;div id="foreword-and-abstract" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#foreword-and-abstract" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;The paper opens with the problem: primary-replica architecture&amp;rsquo;s write throughput is limited by the primary. Shared-nothing architecture offers scalable multi-primary clusters that can solve the single-primary limitation, but this architecture suffers performance bottlenecks due to distributed transaction overhead. Recently, shared-storage-based cloud-native multi-primary databases have emerged, but under high-conflict scenarios, they face high conflict resolution costs and low data fusion efficiency.&lt;/p&gt;
&lt;p&gt;So the problem is: single-primary primary-replica, shared-nothing, and shared-storage cloud-native multi-primary architectures all have their own issues.&lt;/p&gt;
&lt;p&gt;This paper proposes PolarDB-MP, a novel multi-primary cloud-native database combining disaggregated shared memory with shared storage. (Since multi-primary cloud-native databases already exist, it needs to be &amp;ldquo;novel.&amp;rdquo;)&lt;/p&gt;
&lt;p&gt;PolarDB-MP&amp;rsquo;s basic characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All nodes can equally access all data, allowing transactions to be processed independently on a single node, &lt;strong&gt;without traditional distributed transaction mechanisms&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Shared storage: PolarStore and PolarFS, or other compatible shared storage solutions.&lt;/li&gt;
&lt;li&gt;Built on disaggregated shared memory.&lt;/li&gt;
&lt;li&gt;Low-latency communication via &lt;strong&gt;RDMA&lt;/strong&gt; (Remote Direct Memory Access).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLSN&lt;/strong&gt; (Local Logical Sequence Number): Used to establish partial order for WAL logs generated by different nodes, combined with custom recovery strategies to ensure consistency and efficiency during abnormal recovery.&lt;/li&gt;
&lt;li&gt;Core component &lt;strong&gt;PMFS&lt;/strong&gt; (Polar Multi-Primary Fusion Server) responsible for:
&lt;ul&gt;
&lt;li&gt;Transaction Fusion — transaction ordering and visibility management&lt;/li&gt;
&lt;li&gt;Buffer Fusion — distributed shared buffer mechanism&lt;/li&gt;
&lt;li&gt;Lock Fusion — cross-node concurrency control&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;Classification
 &lt;div id="classification" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#classification" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;The classification is mainly to understand PolarDB-MP&amp;rsquo;s historical position and the &amp;ldquo;first&amp;rdquo; qualifier:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;PolarDB-MP is the first multi-primary cloud-native database that utilizes disaggregated shared memory and shared storage for transaction coordination and buffer fusion&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/3532d59aa524.png" alt="image-20251109213814089" /&gt;&lt;/p&gt;

&lt;h3 class="relative group"&gt;Competitor Weaknesses
 &lt;div id="competitor-weaknesses" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#competitor-weaknesses" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Shared-nothing products: The paper doesn&amp;rsquo;t call out individual products, just one line: transactions accessing across multiple partitions require significant additional overhead for distributed transactions.&lt;/p&gt;
&lt;p&gt;Oracle:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expensive distributed lock management&lt;/li&gt;
&lt;li&gt;Expensive network overhead&lt;/li&gt;
&lt;li&gt;Reliance on sophisticated hardware (alien tech)&lt;/li&gt;
&lt;li&gt;Difficult to migrate to cloud, or higher TCO (including maintenance and labor costs) compared to cloud-native databases after migration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AWS Aurora-MM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uses optimistic transaction model; high transaction abort rates under conflicts&lt;/li&gt;
&lt;li&gt;In some scenarios, 4-node throughput is lower than single-node&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Huawei Taurus-MM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pessimistic transaction model. Relies on page storage and log replay to ensure cache consistency, with high overhead in concurrency control and data synchronization.&lt;/li&gt;
&lt;li&gt;Under 50% shared data read-write workload, 8 nodes only achieve 1.5x single-node performance improvement&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Oracle critique here is mainly plausible-sounding trash talk, while Aurora-MM and Taurus-MM have original vendor citations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Aurora-MM &amp;ldquo;in some scenarios, 4-node throughput is lower than single-node&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Taurus-MM &amp;ldquo;under 50% shared data read-write workload, 8 nodes only achieve 1.5x single-node performance improvement&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Transaction Fusion
 &lt;div id="transaction-fusion" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#transaction-fusion" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Transaction Fusion Overview
 &lt;div id="transaction-fusion-overview" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#transaction-fusion-overview" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;How does multi-primary ensure consistent data views?&lt;/p&gt;
&lt;p&gt;Snapshot isolation is a common MVCC implementation. A characteristic of snapshot isolation is that queries or transactions must maintain their consistent data view during execution. But in multi-primary architecture, local nodes cannot guarantee consistent data views due to remote data updates.&lt;/p&gt;
&lt;p&gt;To solve this, general multi-primary shared-storage architectures introduce global transaction mechanisms (Aurora-MM or Taurus-MM). PolarDB-MP introduces an innovative technique — transaction fusion within PMFS. &lt;strong&gt;Each node only maintains local transaction information, which can be accessed by other nodes via RDMA.&lt;/strong&gt; In contrast to global transactions, transaction fusion is decentralized.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Local Transactions and TIT Table
 &lt;div id="local-transactions-and-tit-table" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#local-transactions-and-tit-table" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Each node in PolarDB-MP maintains a small amount of memory to store local transaction information (accessible by other nodes via RDMA). This local transaction information is stored in the transaction Information Table (TIT).&lt;/p&gt;
&lt;p&gt;TIT table contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transaction object pointer&lt;/li&gt;
&lt;li&gt;Commit timestamp (CTS) assigned by the global timestamp coordinator (TSO)&lt;/li&gt;
&lt;li&gt;version, representing different transactions in the same slot&lt;/li&gt;
&lt;li&gt;ref, indicating whether this transaction is being waited on by other transactions for lock release (probably PLock or RLock)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/90641c3618d1.png" alt="image-20251101131556184" /&gt;&lt;/p&gt;

&lt;h3 class="relative group"&gt;How Transactions Proceed
 &lt;div id="how-transactions-proceed" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#how-transactions-proceed" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;When a transaction begins, a local transaction id (presumably txid) is assigned, and the TIT slot stores the transaction object pointer, ref initialized to 0, and CTS initialized to &lt;code&gt;CSN_INIT&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;PolarDB-MP uses a global transaction ID to identify a transaction: global transaction ID = (node_id, trx_id, slot_id, version). The global transaction ID does not include CTS. To know the commit order of transactions, such as when constructing a transaction visibility view, you need to go through the global transaction ID, via RDMA, to the target node to find CTS (similar to PG&amp;rsquo;s &lt;code&gt;pg_xact_commit_timestamp()&lt;/code&gt; function, which finds the corresponding transaction commit time from local files using the transaction id).&lt;/p&gt;
&lt;p&gt;If trx_id is the transaction ID in PG, then node_id + trx_id can identify the global uniqueness of a transaction, or node_id + slot_id + version could also work to some extent (when slot id is not reused, e.g., at a given moment it uniquely identifies a transaction). Of course, the extra information combined is also unique. After all, this information is key to PolarDB-MP&amp;rsquo;s transaction fusion implementation.&lt;/p&gt;
&lt;p&gt;Each transaction constructs a visibility view using the global transaction ID and CTS. The visibility view concept is consistent with PG: the current read view can read data rows committed before the read view, and the latest version rows.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Accessing Remote CTS
 &lt;div id="accessing-remote-cts" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#accessing-remote-cts" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Since CTS is local (in TIT or on the local filesystem), obtaining the reading transaction&amp;rsquo;s CTS is an interesting task:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/41db536b2114.png" alt="image-20251101153437311" /&gt;&lt;/p&gt;
&lt;p&gt;1.1 If a row&amp;rsquo;s CTS is CSN_INIT/CTS_INIT, meaning the transaction is still active, return the maximum CTS to indicate it&amp;rsquo;s invisible to all transactions except itself.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;If a row&amp;rsquo;s CTS is not CSN_INIT/CTS_INIT, meaning the transaction has committed, and it&amp;rsquo;s in the local TIT, directly return CTS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If a row has no CTS, obtain CTS via the row&amp;rsquo;s g_trx_id.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;2.1 If the transaction belongs to the local node (g_trx_id has node id), read from local filesystem to local TIT.&lt;/p&gt;
&lt;p&gt;2.2 If the transaction doesn&amp;rsquo;t belong to the local node, read from remote filesystem to remote TIT via RDMA.&lt;/p&gt;
&lt;p&gt;3.1 If slot.version != g_trx_id.version, the transaction must have committed, so the row is definitely visible to all transactions. Return minimum CTS to indicate visibility to all transactions.&lt;/p&gt;
&lt;p&gt;3.2 If slot.version = g_trx_id.version, refer to 1.1, 1.2.&lt;/p&gt;
&lt;p&gt;PolarDB-MP&amp;rsquo;s transaction visibility concept is very similar to PG&amp;rsquo;s, except PG uses txid instead of CTS to indicate transaction ordering and doesn&amp;rsquo;t need to consider remote access.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Row Update Transactions
 &lt;div id="row-update-transactions" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#row-update-transactions" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Additionally, row updates are also very similar:&lt;/p&gt;
&lt;p&gt;When PolarDB-MP updates a row, besides updating the data itself, it must also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Update the row&amp;rsquo;s global transaction ID (g_trx_id) (if it&amp;rsquo;s an in-row update, then it modifies PG&amp;rsquo;s row header).&lt;/li&gt;
&lt;li&gt;Update the row&amp;rsquo;s CTS. (The paper doesn&amp;rsquo;t specify whether this is in the row header or filesystem. If similar to PG, it should be in the &lt;code&gt;commit_ts&lt;/code&gt; directory on the filesystem. Polar not confirmed.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;Questions About Transaction Fusion (Things I Didn&amp;rsquo;t Understand)
 &lt;div id="questions-about-transaction-fusion-things-i-didnt-understand" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#questions-about-transaction-fusion-things-i-didnt-understand" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;g_trx_id is row metadata written to disk. If nodes are added or removed, does the node_id in the data row&amp;rsquo;s g_trx_id need updating? If not, which node should the row be loaded into when read next time?&lt;/p&gt;
&lt;p&gt;A new row&amp;rsquo;s CTS is stored on local node A. If another node B updates this row, is the new CTS on node A or B?&lt;/p&gt;
&lt;p&gt;&amp;ldquo;assigned a read view, which consists of its own g_trx_id and the current CTS.&amp;rdquo; Do read-only transactions also get assigned a g_trx_id when constructing a read view?&lt;/p&gt;
&lt;p&gt;Without a doubt, a parameter like &lt;code&gt;track_commit_timestamp&lt;/code&gt; must be forcibly enabled.&lt;/p&gt;
&lt;p&gt;If there are many writes on node A and reads on node B, B&amp;rsquo;s reads will access A&amp;rsquo;s TIT data via RDMA — does this generate significant network IO? Should this be considered when designing read-write separation or multi-node reads and writes? The original paper might answer this — &amp;ldquo;Multi-primary architectures inherently require synchronizing large amounts of data and messages between nodes to support concurrent access across multiple nodes. As network technology develops (InfiniBand, RDMA) and achieves commercial deployment, the network bottleneck becomes less significant.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Global timestamps could become a bottleneck in distributed systems. PolarDB-SCC is a shared-storage-based timestamp solution that appears to perform well. Due to time constraints, I&amp;rsquo;ll set this aside for now.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Buffer Fusion
 &lt;div id="buffer-fusion" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#buffer-fusion" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Buffer Fusion Introduction
 &lt;div id="buffer-fusion-introduction" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#buffer-fusion-introduction" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Each node in PolarDB-MP can update any data page, leading to substantial data transfer. Buffer Fusion&amp;rsquo;s distributed buffer pool (DBP) is designed to solve this problem. Each node has a local buffer pool (LBP), which is a subset of DBP.&lt;/p&gt;

&lt;h3 class="relative group"&gt;How Buffer Fusion Works
 &lt;div id="how-buffer-fusion-works" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#how-buffer-fusion-works" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;LBP has two new metadata items for pages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;valid: whether the page has been updated by another node&lt;/li&gt;
&lt;li&gt;r_addr: pointer to the page in DBP&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/652bf5e74943.png" alt="image-20251102105723909" /&gt;&lt;/p&gt;
&lt;p&gt;When accessing a page from LBP, the current node must first check if the page is valid. If invalid, it must access DBP via r_addr. After DBP stores a new version of the page, buffer fusion invalidates all remote pages. In LBP, dirty pages are periodically flushed to DBP in the background or after releasing the PLock lock.&lt;/p&gt;
&lt;p&gt;Page access steps:&lt;/p&gt;
&lt;p&gt;1.1 If the page is in LBP and valid, access directly.
1.2 If the page is in LBP and invalid, access DBP via RDMA.
2. If the page is in neither LBP nor DBP, read from shared storage.
3. The page is loaded from a node into LBP and registered in DBP.&lt;/p&gt;
&lt;p&gt;PolarDB&amp;rsquo;s buffer fusion key component is disaggregated shared memory. It appears to be a/group of physical hardware or an integrated component built on top of it, separate from compute nodes. This differs significantly from memory in traditional distributed systems.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s also different from transaction fusion: transaction fusion requires accessing remote nodes with the same architecture, while buffer fusion doesn&amp;rsquo;t require accessing remote nodes with the same architecture — it separately accesses the disaggregated shared storage component.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Questions About Buffer Fusion (Things I Didn&amp;rsquo;t Understand)
 &lt;div id="questions-about-buffer-fusion-things-i-didnt-understand" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#questions-about-buffer-fusion-things-i-didnt-understand" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Disaggregated shared memory seems like a component separate from standard hosts — so what exactly is it?&lt;/p&gt;

&lt;h2 class="relative group"&gt;Lock Fusion
 &lt;div id="lock-fusion" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#lock-fusion" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Lock Types in Lock Fusion
 &lt;div id="lock-types-in-lock-fusion" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#lock-types-in-lock-fusion" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Buffer fusion solves how nodes access remote data; lock fusion solves concurrent access control.&lt;/p&gt;
&lt;p&gt;Buffer fusion has two types of locks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;page-locking (PLock): Similar to latches, controlling atomic access and internal structure consistency. Single-node page access doesn&amp;rsquo;t use PLock.&lt;/li&gt;
&lt;li&gt;row-locking (RLock): Responsible for cross-node transaction control, following the two-phase lock protocol.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;PLock Access Flow
 &lt;div id="plock-access-flow" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#plock-access-flow" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;(The paper doesn&amp;rsquo;t say where lock fusion occurs. Since PLock is a page-level latch and page fusion happens on shared memory, I&amp;rsquo;ll assume lock fusion also occurs on shared memory, as this is easier to understand.)&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Before updating/reading a page, the &lt;em&gt;local lock manager&lt;/em&gt; checks whether the local node already holds the corresponding X/S PLock (or higher-level lock).
1.1 If yes, execute in place.
1.2 If no, acquire PLock through Lock Fusion.&lt;/li&gt;
&lt;li&gt;Lock fusion checks for conflicts before responding; if a conflict exists, the request waits.&lt;/li&gt;
&lt;li&gt;When PLock is released by a node, it notifies Lock Fusion, which updates PLock&amp;rsquo;s state and notifies other nodes to continue their operations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/92aa2082d68f.png" alt="image-20251102142359091" /&gt;&lt;/p&gt;

&lt;h3 class="relative group"&gt;PLock Lazy Releasing
 &lt;div id="plock-lazy-releasing" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#plock-lazy-releasing" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;According to the PLock access flow above, a PLock is immediately released after local operations complete. This may not be optimal — according to temporal locality: &amp;ldquo;a data item or instruction accessed at a given time is likely to be accessed again in the near future.&amp;rdquo; Lazy releasing minimizes PLock lock RPC access load.&lt;/p&gt;
&lt;p&gt;The principle is simple: PLock is not immediately released after use on the local node; it&amp;rsquo;s only released when ref reaches 0.&lt;/p&gt;
&lt;p&gt;When other nodes need PLock, Lock Fusion also sends negotiation messages to intervene when the local node is holding the lock; the local node must communicate with Lock Fusion rather than autonomously handling PLock. Lock Fusion uses a &amp;ldquo;first-in-first-out&amp;rdquo; strategy to resolve cross-node lock ownership, again until the local node&amp;rsquo;s ref = 0, at which point other nodes can acquire the lock.&lt;/p&gt;
&lt;p&gt;Lazy releasing is an effective distributed lock solution, balancing local lock optimization with global lock allocation.&lt;/p&gt;

&lt;h3 class="relative group"&gt;RLock Overview
 &lt;div id="rlock-overview" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#rlock-overview" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;RLock uses the global transaction ID for determination (similar to PG). According to the transaction fusion content, the global transaction ID contains node id, transaction id, slot id, version. So when a local node reads a row, it can directly obtain the lock information on the row, know where the lock is (node id), and know if the lock is active.&lt;/p&gt;
&lt;p&gt;There are two interesting points about determining transaction activity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;From the transaction fusion flow of accessing remote CTS: if the transaction&amp;rsquo;s CTS is a valid value, or the transaction is in the same slot in TIT but not the same version, the transaction has definitely committed, so no need to check activity. If the source transaction is not active, there&amp;rsquo;s no need to wait for locks — proceed directly.&lt;/li&gt;
&lt;li&gt;PG has the concept of a minimum active transaction ID, which also exists in PolarDB-MP. If the transaction ID on the row is less than the global minimum active transaction ID, the source transaction must have also committed (or rolled back).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 class="relative group"&gt;How RLock Works
 &lt;div id="how-rlock-works" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#how-rlock-works" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Local rows are handled locally; only conflicts are processed in Lock Fusion; cross-node row locks require RLock. &amp;ldquo;The transaction ID in the row functions as a lock indicator. So this protocol only supports exclusive (X) lock. The shared (S) lock on a row is not supported in PolarDB-MP, but it&amp;rsquo;s acceptable.&amp;rdquo; Only truly conflicting exclusive locks need RLock; shared locks don&amp;rsquo;t need RLock.&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/a84d705b09b8.png" alt="image-20251102155613001" /&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;T30 reads the row from shared storage and can determine from the row&amp;rsquo;s metadata (g_trx_id) that the transaction is active and which node it&amp;rsquo;s on.&lt;/li&gt;
&lt;li&gt;T30 remotely adjusts T10&amp;rsquo;s transaction ref.&lt;/li&gt;
&lt;li&gt;T30 sends a wait status to the Lock Fusion service.&lt;/li&gt;
&lt;li&gt;Lock Fusion adds wait information to the wait info table.&lt;/li&gt;
&lt;li&gt;T10 finishes execution and notifies Lock Fusion.&lt;/li&gt;
&lt;li&gt;Lock Fusion checks the wait info table, then notifies T30 it can continue.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 class="relative group"&gt;Questions About Lock Fusion (Things I Didn&amp;rsquo;t Understand)
 &lt;div id="questions-about-lock-fusion-things-i-didnt-understand" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#questions-about-lock-fusion-things-i-didnt-understand" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;&amp;ldquo;when attempting to update a row, it must already hold an X PLock lock on the page containing the row&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Updating also requires holding an exclusive PLock on the page, meaning updates on the same page block each other — doesn&amp;rsquo;t this affect concurrency? Locally, there shouldn&amp;rsquo;t be such behavior; PG doesn&amp;rsquo;t have page-exclusive locks for update scenarios.&lt;/p&gt;
&lt;p&gt;In the &amp;ldquo;Logs ordering and recovery&amp;rdquo; chapter, there are two statements: &amp;ldquo;Thanks to the PLock design, only one transaction can update a page at a time&amp;rdquo; and &amp;ldquo;When a page is updated across two nodes, one node pushes its updated page to the DBP before releasing the PLock, allowing the next node to retrieve it from the DBP.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Yes, during &lt;strong&gt;cross-node&lt;/strong&gt; data updates, there are page-level exclusive locks.&lt;/p&gt;

&lt;h2 class="relative group"&gt;PMFS Summary (Hot Take)
 &lt;div id="pmfs-summary-hot-take" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#pmfs-summary-hot-take" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;PMFS (Polar Multi-Primary Fusion Server) is the core component implementing PolarDB-MP&amp;rsquo;s multi-primary distributed system. Among its features, the &lt;strong&gt;global transaction ID&lt;/strong&gt; design is ingenious — it transforms PG&amp;rsquo;s transaction ID into one containing node information, transaction id, and transaction fusion&amp;rsquo;s slot and version information, placed in the row header. This has several benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Directly accessing a row reveals the row&amp;rsquo;s version ordering.&lt;/li&gt;
&lt;li&gt;Directly accessing a row reveals which node updated it.&lt;/li&gt;
&lt;li&gt;Directly accessing a row reveals whether cross-node locks may exist.&lt;/li&gt;
&lt;li&gt;Uses minimum active transactions to reduce conflict determination.&lt;/li&gt;
&lt;li&gt;Uses global transaction ID information to achieve distributed retrieval of transaction commit timestamps (CTS).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Buffer fusion and lock fusion in PMFS appear highly dependent on the shared memory component.&lt;/li&gt;
&lt;li&gt;RDMA is omnipresent throughout.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 class="relative group"&gt;Log Ordering
 &lt;div id="log-ordering" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#log-ordering" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;

&lt;h3 class="relative group"&gt;Partial Order
 &lt;div id="partial-order" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#partial-order" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;First, WAL is generated on each node without any concurrency control mechanism — each writes independently to shared storage. Each node&amp;rsquo;s LSN is sequential for that node, but across multiple nodes, WAL records don&amp;rsquo;t exhibit global ordering.&lt;/p&gt;
&lt;p&gt;But is global ordering needed when writing WAL records?&lt;/p&gt;
&lt;p&gt;From the paper, most of the time it&amp;rsquo;s not needed.&lt;/p&gt;
&lt;p&gt;Only one case requires guaranteed global ordering during writing: cross-node updates to the same page.&lt;/p&gt;
&lt;p&gt;However, according to the PMFS lock fusion mechanism, cross-node updates to the same page are exclusive. Lock fusion can ensure the ordering of cross-node page updates.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Recovery Ordering
 &lt;div id="recovery-ordering" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#recovery-ordering" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;Since LLSNs from cross-node writes come from multiple nodes and are likely not in order, recovery needs to be done in order. Reading all WAL records and sorting by LLSN is a simple approach, but massive sorting is very resource-intensive.&lt;/p&gt;
&lt;p&gt;PolarDB-MP proposes segment-wise sorting of LLSN — each segment is called a chunk, with chunk boundaries called LLSN bounds. PolarDB-MP can guarantee that an LLSN bound is always less than the next bound, then sort LLSNs within each chunk.&lt;/p&gt;

&lt;h3 class="relative group"&gt;Questions About Log Ordering (Things I Didn&amp;rsquo;t Understand)
 &lt;div id="questions-about-log-ordering-things-i-didnt-understand" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#questions-about-log-ordering-things-i-didnt-understand" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h3&gt;
&lt;p&gt;&amp;ldquo;utilizing redo (write-ahead) logs for data recovery and undo logs for rolling back uncommitted changes&amp;rdquo;&lt;/p&gt;
&lt;p&gt;PolarDB-MP has undo log files? What is this undo for?&lt;/p&gt;
&lt;p&gt;I didn&amp;rsquo;t see anything particularly special about LLSN; the paper doesn&amp;rsquo;t detail its structure. LSN seems sufficient — maybe there are differences regarding global transaction IDs.&lt;/p&gt;

&lt;h2 class="relative group"&gt;Evaluation
 &lt;div id="evaluation" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#evaluation" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Read-only operations are all local, so adding nodes linearly increases throughput. If read-write/write-only data is well-partitioned and doesn&amp;rsquo;t cross nodes, it&amp;rsquo;s also nearly linear.&lt;/p&gt;
&lt;p&gt;The problem lies in shared data across read-write/write-only nodes, which is the ultimate test of distributed database performance.&lt;/p&gt;
&lt;p&gt;The paper directly compares against Huawei&amp;rsquo;s Taurus-MM. The conclusion: PolarDB-MP&amp;rsquo;s cross-node write performance is indeed significantly better.&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/eec82fe8cb39.png" alt="image-20251109212445723" /&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;Nitpicking
 &lt;div id="nitpicking" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#nitpicking" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;The paper mentions Taurus-MM&amp;rsquo;s performance improvement under 8-node shared data in two places, but the data is inconsistent:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;The eight-node cluster only improves the throughput by 1.8× compared to the single-node version in the read-write workload with 50% shared data.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote&gt;&lt;p&gt;the throughput of Taurus-MM&amp;rsquo;s eight-node cluster is approximately 1.8× that of a single node under the SysBench write-only workload with 30% shared data, illustrating the trade-offs and challenges in optimizing multi-primary cloud databases&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;Sometimes 30% shared data, sometimes 50% — not very rigorous. The original &lt;a href="https://www.vldb.org/pvldb/vol16/p3488-depoutovitch.pdf" target="_blank" rel="noreferrer"&gt;Taurus MM paper&lt;/a&gt; says 50%:&lt;/p&gt;
&lt;p&gt;


&lt;img src="https://lastdba.com/img/csdn/34c0ec5e940b.png" alt="image-20251025162117902" /&gt;&lt;/p&gt;

&lt;h2 class="relative group"&gt;Summary
 &lt;div id="summary" class="anchor"&gt;&lt;/div&gt;
 
 &lt;span
 class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none"&gt;
 &lt;a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#summary" aria-label="Anchor"&gt;#&lt;/a&gt;
 &lt;/span&gt;
 
&lt;/h2&gt;
&lt;p&gt;Not much to summarize — see the &lt;em&gt;Foreword and Abstract&lt;/em&gt; and &lt;em&gt;PMFS Summary&lt;/em&gt; sections.&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Original link: &lt;a href="https://lastdba.com/2025/11/30/" target="_blank" rel="noreferrer"&gt;https://lastdba.com/2025/11/30/&lt;/a&gt;论文精读polar-db-mp2024-sigmod最佳工业论文/&lt;/p&gt;
&lt;/blockquote&gt;</content:encoded></item></channel></rss>