Case: From Inaccurate DISTINCT to the Principles of DISTINCT Estimation

Table of Contents

Problem Description
#

The n_distinct statistic was severely inaccurate.

This problem appeared across multiple databases. For example:

A table with 200 million rows and a true DISTINCT count of 8 million had a statistics DISTINCT value of only 40,000.

Analysis
#

Sampling Model
#

Does the standby have its own statistics? · PostgreSQL Apprentice

The default default_statistics_target=100 means 30,000 rows are sampled from 30,000 pages.

analyze verbose tablzl1;
INFO:  00000: analyzing "public.tablzl1"
LOCATION:  do_analyze_rel, analyze.c:332
INFO:  00000: "tablzl1": scanned 30000 of 22963751 pages, containing 1061942 live rows and 3953 dead rows; 30000 rows in sample, 812872389 estimated total rows
LOCATION:  acquire_sample_rows, analyze.c:1340

Note “scanned 30000” and “30000 rows in sample”.

DISTINCT Estimation Algorithm
#

The DISTINCT estimation algorithm in analyze.c:

			/*----------
			 * Estimate the number of distinct values using the estimator
			 * proposed by Haas and Stokes in IBM Research Report RJ 10025:
			 *		n*d / (n - f1 + f1*n/N)
			 * where f1 is the number of distinct values that occurred
			 * exactly once in our sample of n rows (from a total of N),
			 * and d is the total number of distinct values in the sample.
			 * This is their Duj1 estimator; the other estimators they
			 * recommend are considerably more complex, and are numerically
			 * very unstable when n is much smaller than N.
			 *
			 * In this calculation, we consider only non-nulls.  We used to
			 * include rows with null values in the n and N counts, but that
			 * leads to inaccurate answers in columns with many nulls, and
			 * it's intuitively bogus anyway considering the desired result is
			 * the number of distinct non-null values.
			 *
			 * We assume (not very reliably!) that all the multiply-occurring
			 * values are reflected in the final track[] list, and the other
			 * nonnull values all appeared but once.  (XXX this usually
			 * results in a drastic overestimate of ndistinct.  Can we do
			 * any better?)
			 *----------
			 */
			int			f1 = nonnull_cnt - summultiple;
			int			d = f1 + nmultiple;
			double		n = samplerows - null_cnt;
			double		N = totalrows * (1.0 - stats->stanullfrac);
			double		stadistinct;

n*d / (n - f1 + f1*n/N)

n = number of sample rows (rows scanned)
d = number of distinct values found in the sample
f1 = number of values appearing exactly once in the sample
N = total number of rows in the table

Algorithm paper: https://hugepdf.com/download/download-extended-version-of-this-paper_pdf

The paper is rather dense, so let’s work through some assumptions to understand this DISTINCT algorithm:

Assume all values appear exactly once, and the table is large (n « N), so f1 = d, n/N ≈ 0

d*d / (d - d + d*0) = d²/0 — this would evaluate to -1.

Assume all values appear exactly once, and the table is small (n = N), so f1 = d, n/N = 1

n*d / (n - d + d*1) = d — the sampled distinct count, which equals the number of sampled rows.

Assume no values appear exactly once in the sample, i.e., f1 = 0

n*d / (n - f1 + f1*n/N) = n*d / n = d — just the distinct count in the sample.

If a column is populated by inserting several rows of the same value, then several rows of another value, like:

11, 2, 2, 2, 2, 3, 3, 3, …

3.1 Small table, all 30,000 rows sampled, true distinct = 10,000 (assumed): estimated distinct = d = 10,000

3.2 Large table, sample contains both repeating values and singletons (some repeating values only have one row captured), i.e., n = 30,000, n/N ≈ 0

n*d / (n - f1 + f1*n/N) = n*d / (n - f1) = 30000*d/(30000-f1) — the larger the distinct count in the sample, the larger the estimated distinct; the larger the number of singletons, the larger the estimated distinct.

Summary:

DISTINCT estimation is directly related to the distinct count and singleton count in the sample
If the singleton count = 0, then larger samples yield larger estimated distinct values

Verification
#

Since the default maximum sample size is 30,000 rows, for tables larger than this, the estimator is likely to underestimate DISTINCT. Note: the data should not have too many unique values.

Testing a table with different sample sizes:

Table: reltuples = 800 million, relpages = 20 million, size = 175GB, true column distinct = 100 million

target statistics	pages sampling ratio (approx)	tuples sampling ratio (approx)	n_distinct	execution time
50	0.00075	0.00001875	60k	2s
100	0.0015	0.0000375	110k	5s
1000	0.015	0.000375	1.03M	58s
3000	0.045	0.001125	2.68M	3min 1s
10000	0.15	0.00375	6.75M	7min 21s

(maximum target statistics is 10000)

A rough conclusion: n_distinct and ANALYZE execution time grow proportionally with the sample size.

n_distinct grows with sample size, while pages and tuples estimates remain consistently accurate.

Solution
#

For extremely large tables, consider partitioning or optimizing based on actual SQL patterns.

You can also adjust the statistics target. The default default_statistics_target=100 means 30,000 rows from 30,000 pages.

Temporary fix:

set default_statistics_target=3000;
analyze tab1;

Long-term fix:

alter table tab1 alter column col1 set STATISTICS 3000;

Notes:

Column-level statistics target has the highest priority, overriding default_statistics_target
Maximum statistics target is 10000
The table’s sampling target is determined by the maximum column target:

	/*
	 * Determine how many rows we need to sample, using the worst case from
	 * all analyzable columns.  We use a lower bound of 100 rows to avoid
	 * possible overflow in Vitter's algorithm.  (Note: that will also be the
	 * target in the corner case where there are no analyzable columns.)
	 */
	targrows = 100;
	for (i = 0; i < attr_cnt; i++)
	{
		if (targrows < vacattrstats[i]->minrows)
			targrows = vacattrstats[i]->minrows;
	}
	for (ind = 0; ind < nindexes; ind++)
	{
		AnlIndexData *thisdata = &indexdata[ind];

		for (i = 0; i < thisdata->attr_cnt; i++)
		{
			if (targrows < thisdata->vacattrstats[i]->minrows)
				targrows = thisdata->vacattrstats[i]->minrows;
		}
	}

If ANALYZE collects more or fewer rows than expected, check pg_statistic for per-column stattarget settings:

select attrelid::regclass,attname,attstattarget from pg_attribute where attrelid = 'tab1'::regclass and attstattarget not in (-1,0);

Summary
#

For large tables where columns are non-unique but have high distinct counts (a realistic scenario), the sampling algorithm underestimates the DISTINCT value, and this is positively correlated with the sampling ratio. The default sampling ratio is too small for large tables. You can increase it, but even the maximum is not that large.

Case: GRANT Authorization Causes Walsender to Hang

Jun 26, 2025·1613 words·8 mins

PostgreSQL案例

Symptoms # The walsender’s LSN stopped advancing. The stack trace showed it was stuck in pathman’s invalidate_psin_entries_using_relid, with the relid constantly changing and the walsender CPU pegged at 100%. pstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=42319501) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=42319501) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=<optimized out>, relid=42319501) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x3a391c8) at inval.c:595 #5 0x000000000071d50e in ReorderBufferExecuteInvalidations (rb=0x1b63ff8, txn=0x1be5f58, txn=0x1be5f58) at reorderbuffer.c:2238 #6 ReorderBufferCommit (rb=0x1b63ff8, xid=xid@entry=4285897514, commit_lsn=405674661986920, end_lsn=<optimized out>, commit_time=commit_time@entry=799377897828299, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1819 #7 0x0000000000712d18 in DecodeCommit (xid=4285897514, parsed=0x7fffaadf8630, buf=0x7fffaadf87f0, ctx=0x1a359e8) at decode.c:637 #8 DecodeXactOp (ctx=0x1a359e8, buf=buf@entry=0x7fffaadf87f0) at decode.c:245 #9 0x00000000007130b2 in LogicalDecodingProcessRecord (ctx=0x1a359e8, record=0x1a35c80) at decode.c:114 #10 0x0000000000733662 in XLogSendLogical () at walsender.c:2885 #11 0x0000000000735942 in WalSndLoop (send_data=send_data@entry=0x733620 <XLogSendLogical>) at walsender.c:2287 #12 0x0000000000736692 in StartLogicalReplication (cmd=0x1846c68) at walsender.c:1213 #13 exec_replication_command (cmd_string=cmd_string@entry=0x181a288 "START_REPLICATION SLOT \"lzl_logical_rep\" LOGICAL 170F5/7C3EAE78 (\"proto_version\" '1', \"publication_names\" 'lzl_logical_rep')") at walsender.c:1640 #14 0x0000000000774e91 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1866478, dbname=0x18662b8 "lzldb", username=<optimized out>) at postgres.c:4325 #15 0x0000000000485989 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4526 #16 BackendStartup (port=0x18635b0) at postmaster.c:4210 #17 ServerLoop () at postmaster.c:1739 #18 0x0000000000702f08 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1814da0) at postmaster.c:1412 #19 0x000000000048660a in main (argc=3, argv=0x1814da0) at main.c:210 ## Second execution, same stack, different relid pstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=26560221) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=26560221) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=<optimized out>, relid=26560221) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x39f1f68) at inval.c:595 ... Analysis # The changing relid showed that the walsender was still running, not dead. The LSN was not advancing, so we analyzed the LSN position to see what the transaction was doing.

Problem Description#

Analysis#

Sampling Model#

DISTINCT Estimation Algorithm#

Verification#

Solution#

Summary#

Related