Skip to main content
  1. Posts/

How Does PG Access Basic System Tables Before pg_class Exists?

·1815 words·9 mins
liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.

How does the database access system tables before pg_class exists? This question can be divided into two stages:

  1. Database cluster initialization — at this point no database exists at all, so how to construct and access system tables like pg_class is a problem.
  2. Private memory initialization of system tables. PG stores system table information in the local backend process. How does the backend load pg_class during initialization?

Initializing the Data Dictionary
#

When the database hasn’t been initialized yet, it’s obviously impossible to access the data dictionary to initialize objects like database, pg_class, etc., because without a database you can’t CREATE DATABASE, and without pg_class you can’t look up metadata information.

PG uses a special language in BKI files to initialize some data structures, then initializes a primitive database in bootstrap mode1.

Compilation Phase: genbki.h & genbki.pl
#

src/include/catalog/genbki.h:

 * genbki.h defines CATALOG(), BKI_BOOTSTRAP and related macros
 * so that the catalog header files can be read by the C compiler.
 * (These same words are recognized by genbki.pl to build the BKI
 * bootstrap file from these header files.)

genbki.h is quite minimal — mainly macro definitions for catalog-related operations, as well as macros for the BKI bootstrap file. Data dictionary header files all include genbki.h.

genbki.pl reads the .h table definition files from /src/include/catalog during compilation (excluding pg_*_d.h), and creates the postgres.bki file and pg_*_d.h header files.

Taking pg_class as an example:

[postgres@catalog]$ ll |grep pg_class
-rw-r----- 1 postgres postgres   3682 Aug  6  2019 pg_class.dat
lrwxrwxrwx 1 postgres postgres     86 Apr  8 20:31 pg_class_d.h -> /lzl/soft/postgresql-11.5/src/backend/catalog/pg_class_d.h
-rw-r----- 1 postgres postgres   5219 Aug  6  2019 pg_class.h

The pg_*_d.h header files are generated by genbki.pl. All pg_*_d.h files contain the following line:

It has been GENERATED by src/backend/catalog/genbki.pl

Each data dictionary has a struct typedef struct FormData_*catalogname* for storing the row data of the data dictionary2, for example pg_class’s FormData_pg_class:

CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,RelationRelation_Rowtype_Id) BKI_SCHEMA_MACRO
{
	/* oid */
	Oid			oid;

	/* class name */
	NameData	relname;

	/* OID of namespace containing this class */
	Oid			relnamespace BKI_DEFAULT(pg_catalog) BKI_LOOKUP(pg_namespace);

	/* OID of entry in pg_type for relation's implicit row type, if any */
	Oid			reltype BKI_LOOKUP_OPT(pg_type);

	/* OID of entry in pg_type for underlying composite type, if any */
	Oid			reloftype BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_type);

	/* class owner */
	Oid			relowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid);

...

	/* access-method-specific options */
	text		reloptions[1] BKI_DEFAULT(_null_);

	/* partition bound node tree */
	pg_node_tree relpartbound BKI_DEFAULT(_null_);
#endif
} FormData_pg_class;

pg_class’s OID is hardcoded as 1259, and all fields are in the FormData_pg_class struct.

After initializing the struct for user data storage, the corresponding .dat file is used to insert base data. pg_class inserts 4 rows of data, which can be understood as bootstrap items (49 data dictionary tables in PG15):

{ oid => '1247',
  relname => 'pg_type', reltype => 'pg_type' },
{ oid => '1249',
  relname => 'pg_attribute', reltype => 'pg_attribute' },
{ oid => '1255',
  relname => 'pg_proc', reltype => 'pg_proc' },
{ oid => '1259',
  relname => 'pg_class', reltype => 'pg_class' },
postgres=#  select oid,relname from  pg_class where oid::int >=1247 and oid::int<=1259;
 oid  |   relname
------+--------------
 1247 | pg_type
 1249 | pg_attribute
 1255 | pg_proc
 1259 | pg_class

Once the base data dictionary is written, everything else can be generated from it.

Database Initialization Phase: initdb & postgres.bki
#

Comment from initdb.c:

 * To create template1, we run the postgres (backend) program in bootstrap
 * mode and feed it data from the postgres.bki library file.  After this
 * initial bootstrap phase, some additional stuff is created by normal
 * SQL commands fed to a standalone backend.

The backend is launched in bootstrap mode and runs the postgres.bki script. postgres.bki can execute relevant functions without any system tables. Only after this can normal SQL files and standard backend processes be used.

template1 can be called the bootstrap database. The postgres and template0 databases are created only after template1 is established:

void
initialize_data_directory(void)
{
...
	/* Bootstrap template1 */
	bootstrap_template1();
...
	make_template0(cmdfd);

	make_postgres(cmdfd);

	PG_CMD_CLOSE;

	check_ok();
}

Once template1 exists, make_template0 and make_postgres create the corresponding template0 and postgres databases, using the normal SQL CREATE DATABASE command:

/*
 * copy template1 to postgres
 */
static void
make_postgres(FILE *cmdfd)
{
	const char *const *line;

	/*
	 * Just as we did for template0, and for the same reasons, assign a fixed
	 * OID to postgres and select the file_copy strategy.
	 */
	static const char *const postgres_setup[] = {
		"CREATE DATABASE postgres OID = " CppAsString2(PostgresDbOid)
		" STRATEGY = file_copy;\n\n",
		"COMMENT ON DATABASE postgres IS 'default administrative connection database';\n\n",
		NULL
	};

	for (line = postgres_setup; *line; line++)
		PG_CMD_PUTS(*line);
}

Backend Local Cache of Data Dictionary
#

For PG private memory basics, refer to PostgreSQL Memory Analysis3.

PG’s data dictionary information is stored in the local backend process, not shared. The data dictionary cache mainly focuses on syscache/catcache and relcache, which cache system table and table schema information respectively.

syscache/catcache is used to cache system tables, with syscache acting as the upper layer of catcache. syscache is an array where each element corresponds to a catcache, and each catcache corresponds to a system table1.

//PG15.3 SysCacheSize=35
static CatCache *SysCache[SysCacheSize];

When PG forks a backend, it calls InitPostgres, which calls the initialization functions for syscache/catcache and relcache. Let’s look at backend initialization.

syscache/catcache Initialization
#

struct cachedesc
{
	Oid			reloid;			/* OID of the relation being cached */
	Oid			indoid;			/* OID of index relation for this cache */
	int			nkeys;			/* # of keys needed for cache lookup */
	int			key[4];			/* attribute numbers of key attrs */
	int			nbuckets;		/* number of hash buckets for this cache */
};

static const struct cachedesc cacheinfo[] = {
	{
...
	{RelationRelationId,		/* RELNAMENSP */
		ClassNameNspIndexId,
		2,
		{
			Anum_pg_class_relname,
			Anum_pg_class_relnamespace,
			0,
			0
		},
		128
	},
	{RelationRelationId,		/* RELOID */
		ClassOidIndexId,
		1,
		{
			Anum_pg_class_oid,
			0,
			0,
			0
		},
		128
...
};

For example, Anum_pg_class_oid is defined in pg_class_d.h generated by genbki.pl:

#define Anum_pg_class_oid 1

reloid is the OID:

  select oid,relname from  pg_class where oid::int >=1247 and oid::int<=1259;
 oid  |   relname
------+--------------
 1259 | pg_class

InitCatalogCache actually initializes the syscache array, i.e., initializes all catcaches. InitCatalogCache eventually fully initializes CatCache through InitCatCache (one of which is for pg_class):

void
InitCatalogCache(void)
{
...
	for (cacheId = 0; cacheId < SysCacheSize; cacheId++)
	{
		SysCache[cacheId] = InitCatCache(cacheId,
										 cacheinfo[cacheId].reloid,
										 cacheinfo[cacheId].indoid,
										 cacheinfo[cacheId].nkeys,
										 cacheinfo[cacheId].key,
										 cacheinfo[cacheId].nbuckets);
		if (!PointerIsValid(SysCache[cacheId]))
			elog(ERROR, "could not initialize cache %u (%d)",
				 cacheinfo[cacheId].reloid, cacheId);
		/* Accumulate data for OID lists, too */
		SysCacheRelationOid[SysCacheRelationOidSize++] =
			cacheinfo[cacheId].reloid;
		SysCacheSupportingRelOid[SysCacheSupportingRelOidSize++] =
			cacheinfo[cacheId].reloid;
		SysCacheSupportingRelOid[SysCacheSupportingRelOidSize++] =
			cacheinfo[cacheId].indoid;
		/* see comments for RelationInvalidatesSnapshotsOnly */
		Assert(!RelationInvalidatesSnapshotsOnly(cacheinfo[cacheId].reloid));
	}
...
	CacheInitialized = true;
}

Then we come to catcache.c.

InitCatCache allocates memory and manages it in CacheMemoryContext. It only assigns some macro-defined OIDs to the corresponding catcache — at this point, tables are not yet opened:

/*
 *		InitCatCache
 *
 *	This allocates and initializes a cache for a system catalog relation.
 *	Actually, the cache is only partially initialized to avoid opening the
 *	relation.  The relation will be opened and the rest of the cache
 *	structure initialized on the first access.
 */
CatCache *
InitCatCache(int id,
			 Oid reloid,
			 Oid indexoid,
			 int nkeys,
			 const int *key,
			 int nbuckets)
{
...

	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
...
	sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
	cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
	cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));

	/*
	 * initialize the cache's relation information for the relation
	 * corresponding to this cache, and initialize some of the new cache's
	 * other internal fields.  But don't open the relation yet.
	 */
	cp->id = id;
	cp->cc_relname = "(not known yet)";
	cp->cc_reloid = reloid;
	cp->cc_indexoid = indexoid;
	cp->cc_relisshared = false; /* temporary */
	cp->cc_tupdesc = (TupleDesc) NULL;
	cp->cc_ntup = 0;
	cp->cc_nbuckets = nbuckets;
	cp->cc_nkeys = nkeys;
	for (i = 0; i < nkeys; ++i)
		cp->cc_keyno[i] = key[i];
...
	MemoryContextSwitchTo(oldcxt);

	return cp;
}

id is the index of the catcache array element. The assigned reloid is the known OID from cacheinfo, and key[4] from cacheinfo is also assigned. Other information is mostly unknown yet — for example, relname, tupdesc — because system tables haven’t been opened yet.

catcache only opens tables during search operations. Although the function name contains *init*, it’s no longer in the initialization process — the relevant functions won’t be shown here.

After syscache/catcache initialization completes, there is actually no tuple information at all.

relcache Initialization
#

The relcache initialization is well explained in PostgreSQL Memory Analysis.

relcache initialization has 5 phases:

  • RelationCacheInitialize - initializes relcache, initially empty
  • RelationCacheInitializePhase2 - initializes shared catalogs and loads 5 global system tables
  • RelationCacheInitializePhase3 - completes relcache initialization and loads 4 basic system tables
  • RelationIdGetRelation - gets rel description by relation id
  • RelationClose - closes a relation

Both RelationCacheInitializePhase2 and RelationCacheInitializePhase3 load system tables, and they must be in order.

RelationCacheInitializePhase2 loads several system tables — interested readers can check the function themselves. RelationCacheInitializePhase3 is the one relevant to our question, let’s look at that:

/*
 *		RelationCacheInitializePhase3
 *
 *		This is called as soon as the catcache and transaction system
 *		are functional and we have determined MyDatabaseId.  At this point
 *		we can actually read data from the database's system catalogs.
 *		We first try to read pre-computed relcache entries from the local
 *		relcache init file.  If that's missing or broken, make phony entries
 *		for the minimum set of nailed-in-cache relations.  Then (unless
 *		bootstrapping) make sure we have entries for the critical system
 *		indexes.  Once we've done all this, we have enough infrastructure to
 *		open any system catalog or use any catcache.  The last step is to
 *		rewrite the cache files if needed.
 */
void
RelationCacheInitializePhase3(void)
{
...
	if (IsBootstrapProcessingMode() ||
		!load_relcache_init_file(false))
	{
		needNewCacheFile = true;

		formrdesc("pg_class", RelationRelation_Rowtype_Id, false,
				  Natts_pg_class, Desc_pg_class);
		formrdesc("pg_attribute", AttributeRelation_Rowtype_Id, false,
				  Natts_pg_attribute, Desc_pg_attribute);
		formrdesc("pg_proc", ProcedureRelation_Rowtype_Id, false,
				  Natts_pg_proc, Desc_pg_proc);
		formrdesc("pg_type", TypeRelation_Rowtype_Id, false,
				  Natts_pg_type, Desc_pg_type);

#define NUM_CRITICAL_LOCAL_RELS 4	/* fix if you change list above */
	}

	MemoryContextSwitchTo(oldcxt);

	/* In bootstrap mode, the faked-up formrdesc info is all we'll have */
	if (IsBootstrapProcessingMode())
		return;

...
		/* now write the files */
		write_relcache_init_file(true);
		write_relcache_init_file(false);
	}
}

IsBootstrapProcessingMode is specifically designed for bootstrap mode — normal backends don’t satisfy this condition.

load_relcache_init_file(false) attempts to load system table information from the init file. load_relcache_init_file(false) passes false meaning it’s a private init file, not a shared one:

[postgres@16384]$ pwd
/pgdata/lzl/data15_6879/base/16384
-- Rough view. strings ignores some info, but table and column names are visible
[postgres@16384]$ strings pg_internal.init |grep pg_class
pg_class_oid_index
pg_class
pg_class_relname_nsp_index
[postgres@16384]$ strings pg_internal.init |grep -E "pg_class|relname"
pg_class_oid_index
pg_class
relname
relnamespace
pg_class_relname_nsp_index
relname
relnamespace

If the init file is damaged or doesn’t exist, loading the init file fails and enters the branch to load 4 basic system tables:

	// Similar to phase 2, load more system table descriptions
	if (IsBootstrapProcessingMode() ||
		!load_relcache_init_file(false))
	{
		needNewCacheFile = true;

		formrdesc("pg_class", RelationRelation_Rowtype_Id, false,
				  Natts_pg_class, Desc_pg_class);
		formrdesc("pg_attribute", AttributeRelation_Rowtype_Id, false,
				  Natts_pg_attribute, Desc_pg_attribute);
		formrdesc("pg_proc", ProcedureRelation_Rowtype_Id, false,
				  Natts_pg_proc, Desc_pg_proc);
		formrdesc("pg_type", TypeRelation_Rowtype_Id, false,
				  Natts_pg_type, Desc_pg_type);

With the 4 basic tables including pg_class, loading subsequent system table information becomes straightforward.

References
#

Related

A Brief Analysis of PostgreSQL TRUNCATE

·2240 words·11 mins
Command Options # TRUNCATE [ TABLE ] [ ONLY ] name [ * ] [, ... ] [ RESTART IDENTITY | CONTINUE IDENTITY ] [ CASCADE | RESTRICT ] 1. ONLY: truncate only the specified table. When a table has inheritance children or child partitions, by default they are truncated together; ONLY can truncate just the inheritance parent table. Partitioned parent tables cannot specify ONLY. -- Cannot truncate only a partitioned parent table => truncate only parttable; ERROR: 42809: cannot truncate only a partitioned table HINT: Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly. LOCATION: ExecuteTruncate, tablecmds.c:1655 -- truncate only the inheritance parent table, only the parent is cleaned => truncate table only parenttable; TRUNCATE TABLE => select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ------------+------- childtable | 1 -- Directly truncate the inheritance parent table, child tables are also cleaned => truncate table parenttable; TRUNCATE TABLE => select tableoid::regclass,count(*) from parenttable group by tableoid::regclass ; tableoid | count ----------+------- (0 rows) 2. RESTART IDENTITY CONTINUE IDENTITY: whether to reset sequences on columns. Default is CONTINUE.

From Extremely Slow Unique Index Scan to Index Bloat

·6561 words·31 mins
How Did a Primary Key Query Access Multiple Data Pages? # Continuing from the previous article: A Classic Case of Long Transactions, Table Bloat, and LIMIT Problems, there was one point not explained in detail: Why does a query using the primary key generate so many shared hits? Why does index bloat cause access to multiple data pages? Can’t data outside the page be located through the corresponding index entry? This relates to index version management — indexes do carry some version information, but not much. Let’s first review PostgreSQL’s btree index structure.

When Does VACUUM Truncate Empty Pages at the End of a Table?

·1534 words·8 mins
VACUUM Truncate # TRUNCATE—Specifies that VACUUM should attempt to truncate off any empty pages at the end of the table and allow the disk space for the truncated pages to be returned to the operating system. This is normally the desired behavior and is the default unless the vacuum_truncate option has been set to false for the table to be vacuumed. Setting this option to false may be useful to avoid ACCESS EXCLUSIVE lock on the table that the truncation requires. This option is ignored if the FULL option is used.