Skip to main content
  1. Posts/

Case: From Inaccurate DISTINCT to the Principles of DISTINCT Estimation

·1049 words·5 mins
liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.

Problem Description
#

The n_distinct statistic was severely inaccurate.

This problem appeared across multiple databases. For example:

A table with 200 million rows and a true DISTINCT count of 8 million had a statistics DISTINCT value of only 40,000.

Analysis
#

Sampling Model
#

Does the standby have its own statistics? · PostgreSQL Apprentice

The default default_statistics_target=100 means 30,000 rows are sampled from 30,000 pages.

analyze verbose tablzl1;
INFO:  00000: analyzing "public.tablzl1"
LOCATION:  do_analyze_rel, analyze.c:332
INFO:  00000: "tablzl1": scanned 30000 of 22963751 pages, containing 1061942 live rows and 3953 dead rows; 30000 rows in sample, 812872389 estimated total rows
LOCATION:  acquire_sample_rows, analyze.c:1340

Note “scanned 30000” and “30000 rows in sample”.

DISTINCT Estimation Algorithm
#

The DISTINCT estimation algorithm in analyze.c:

			/*----------
			 * Estimate the number of distinct values using the estimator
			 * proposed by Haas and Stokes in IBM Research Report RJ 10025:
			 *		n*d / (n - f1 + f1*n/N)
			 * where f1 is the number of distinct values that occurred
			 * exactly once in our sample of n rows (from a total of N),
			 * and d is the total number of distinct values in the sample.
			 * This is their Duj1 estimator; the other estimators they
			 * recommend are considerably more complex, and are numerically
			 * very unstable when n is much smaller than N.
			 *
			 * In this calculation, we consider only non-nulls.  We used to
			 * include rows with null values in the n and N counts, but that
			 * leads to inaccurate answers in columns with many nulls, and
			 * it's intuitively bogus anyway considering the desired result is
			 * the number of distinct non-null values.
			 *
			 * We assume (not very reliably!) that all the multiply-occurring
			 * values are reflected in the final track[] list, and the other
			 * nonnull values all appeared but once.  (XXX this usually
			 * results in a drastic overestimate of ndistinct.  Can we do
			 * any better?)
			 *----------
			 */
			int			f1 = nonnull_cnt - summultiple;
			int			d = f1 + nmultiple;
			double		n = samplerows - null_cnt;
			double		N = totalrows * (1.0 - stats->stanullfrac);
			double		stadistinct;

n*d / (n - f1 + f1*n/N)

  • n = number of sample rows (rows scanned)
  • d = number of distinct values found in the sample
  • f1 = number of values appearing exactly once in the sample
  • N = total number of rows in the table

Algorithm paper: https://hugepdf.com/download/download-extended-version-of-this-paper_pdf

The paper is rather dense, so let’s work through some assumptions to understand this DISTINCT algorithm:

  1. Assume all values appear exactly once, and the table is large (n « N), so f1 = d, n/N ≈ 0

d*d / (d - d + d*0) = d²/0 — this would evaluate to -1.

  1. Assume all values appear exactly once, and the table is small (n = N), so f1 = d, n/N = 1

n*d / (n - d + d*1) = d — the sampled distinct count, which equals the number of sampled rows.

  1. Assume no values appear exactly once in the sample, i.e., f1 = 0

n*d / (n - f1 + f1*n/N) = n*d / n = d — just the distinct count in the sample.

If a column is populated by inserting several rows of the same value, then several rows of another value, like:

11, 2, 2, 2, 2, 3, 3, 3, …

3.1 Small table, all 30,000 rows sampled, true distinct = 10,000 (assumed): estimated distinct = d = 10,000

3.2 Large table, sample contains both repeating values and singletons (some repeating values only have one row captured), i.e., n = 30,000, n/N ≈ 0

n*d / (n - f1 + f1*n/N) = n*d / (n - f1) = 30000*d/(30000-f1) — the larger the distinct count in the sample, the larger the estimated distinct; the larger the number of singletons, the larger the estimated distinct.

Summary:

  • DISTINCT estimation is directly related to the distinct count and singleton count in the sample
  • If the singleton count = 0, then larger samples yield larger estimated distinct values

Verification
#

Since the default maximum sample size is 30,000 rows, for tables larger than this, the estimator is likely to underestimate DISTINCT. Note: the data should not have too many unique values.

Testing a table with different sample sizes:

Table: reltuples = 800 million, relpages = 20 million, size = 175GB, true column distinct = 100 million

target statisticspages sampling ratio (approx)tuples sampling ratio (approx)n_distinctexecution time
500.000750.0000187560k2s
1000.00150.0000375110k5s
10000.0150.0003751.03M58s
30000.0450.0011252.68M3min 1s
100000.150.003756.75M7min 21s

(maximum target statistics is 10000)

A rough conclusion: n_distinct and ANALYZE execution time grow proportionally with the sample size.

n_distinct grows with sample size, while pages and tuples estimates remain consistently accurate.

Solution
#

For extremely large tables, consider partitioning or optimizing based on actual SQL patterns.

You can also adjust the statistics target. The default default_statistics_target=100 means 30,000 rows from 30,000 pages.

Temporary fix:

set default_statistics_target=3000;
analyze tab1;

Long-term fix:

alter table tab1 alter column col1 set STATISTICS 3000;

Notes:

  • Column-level statistics target has the highest priority, overriding default_statistics_target
  • Maximum statistics target is 10000
  • The table’s sampling target is determined by the maximum column target:
	/*
	 * Determine how many rows we need to sample, using the worst case from
	 * all analyzable columns.  We use a lower bound of 100 rows to avoid
	 * possible overflow in Vitter's algorithm.  (Note: that will also be the
	 * target in the corner case where there are no analyzable columns.)
	 */
	targrows = 100;
	for (i = 0; i < attr_cnt; i++)
	{
		if (targrows < vacattrstats[i]->minrows)
			targrows = vacattrstats[i]->minrows;
	}
	for (ind = 0; ind < nindexes; ind++)
	{
		AnlIndexData *thisdata = &indexdata[ind];

		for (i = 0; i < thisdata->attr_cnt; i++)
		{
			if (targrows < thisdata->vacattrstats[i]->minrows)
				targrows = thisdata->vacattrstats[i]->minrows;
		}
	}

If ANALYZE collects more or fewer rows than expected, check pg_statistic for per-column stattarget settings:

select attrelid::regclass,attname,attstattarget from pg_attribute where attrelid = 'tab1'::regclass and attstattarget not in (-1,0);

Summary
#

For large tables where columns are non-unique but have high distinct counts (a realistic scenario), the sampling algorithm underestimates the DISTINCT value, and this is positively correlated with the sampling ratio. The default sampling ratio is too small for large tables. You can increase it, but even the maximum is not that large.

Related

Case Study: Performance Degradation After Adding an Index and the Generic Plan

·2239 words·11 mins
Problem Description # An index was added the night before, and the next morning the CPU was maxed out. The problematic SQL was easy to locate — just one query. The SQL was running for over 30 seconds, but the day before it only took about 3 seconds, so we needed to examine the before-and-after execution plan changes. Only the key parts of the execution plan are shown below.

Case: GRANT Authorization Causes Walsender to Hang

·1613 words·8 mins
Symptoms # The walsender’s LSN stopped advancing. The stack trace showed it was stuck in pathman’s invalidate_psin_entries_using_relid, with the relid constantly changing and the walsender CPU pegged at 100%. pstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=42319501) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=42319501) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=<optimized out>, relid=42319501) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x3a391c8) at inval.c:595 #5 0x000000000071d50e in ReorderBufferExecuteInvalidations (rb=0x1b63ff8, txn=0x1be5f58, txn=0x1be5f58) at reorderbuffer.c:2238 #6 ReorderBufferCommit (rb=0x1b63ff8, xid=xid@entry=4285897514, commit_lsn=405674661986920, end_lsn=<optimized out>, commit_time=commit_time@entry=799377897828299, origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1819 #7 0x0000000000712d18 in DecodeCommit (xid=4285897514, parsed=0x7fffaadf8630, buf=0x7fffaadf87f0, ctx=0x1a359e8) at decode.c:637 #8 DecodeXactOp (ctx=0x1a359e8, buf=buf@entry=0x7fffaadf87f0) at decode.c:245 #9 0x00000000007130b2 in LogicalDecodingProcessRecord (ctx=0x1a359e8, record=0x1a35c80) at decode.c:114 #10 0x0000000000733662 in XLogSendLogical () at walsender.c:2885 #11 0x0000000000735942 in WalSndLoop (send_data=send_data@entry=0x733620 <XLogSendLogical>) at walsender.c:2287 #12 0x0000000000736692 in StartLogicalReplication (cmd=0x1846c68) at walsender.c:1213 #13 exec_replication_command (cmd_string=cmd_string@entry=0x181a288 "START_REPLICATION SLOT \"lzl_logical_rep\" LOGICAL 170F5/7C3EAE78 (\"proto_version\" '1', \"publication_names\" 'lzl_logical_rep')") at walsender.c:1640 #14 0x0000000000774e91 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1866478, dbname=0x18662b8 "lzldb", username=<optimized out>) at postgres.c:4325 #15 0x0000000000485989 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4526 #16 BackendStartup (port=0x18635b0) at postmaster.c:4210 #17 ServerLoop () at postmaster.c:1739 #18 0x0000000000702f08 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1814da0) at postmaster.c:1412 #19 0x000000000048660a in main (argc=3, argv=0x1814da0) at main.c:210 ## Second execution, same stack, different relid pstack 121327 #0 hash_seq_search (status=status@entry=0x7fffaadf8330) at dynahash.c:1441 #1 0x00002ba3b40ec728 in invalidate_psin_entries_using_relid (relid=relid@entry=26560221) at src/relation_info.c:251 #2 0x00002ba3b40ecb3d in forget_status_of_relation (relid=relid@entry=26560221) at src/relation_info.c:232 #3 0x00002ba3b40fcc96 in pathman_relcache_hook (arg=<optimized out>, relid=26560221) at src/hooks.c:934 #4 0x000000000087168a in LocalExecuteInvalidationMessage (msg=0x39f1f68) at inval.c:595 ... Analysis # The changing relid showed that the walsender was still running, not dead. The LSN was not advancing, so we analyzed the LSN position to see what the transaction was doing.

PostgreSQL Ops Experience 2024

·5612 words·27 mins
This article focuses on common PostgreSQL operations issues — rare edge cases that surface once every two or three years are out of scope. It’s primarily a technical ops summary, aiming for clarity and quick applicability. Deep dives at the source-code level are deliberately avoided. SQL Performance & Execution Plans # Sudden Execution Plan Changes # PostgreSQL does not support optimizer hints natively, and the community has made it clear it never will. The PG community’s stance is roughly: “Our optimizer is perfect. If the current plan isn’t good enough, it’s because the developer doesn’t understand optimization.”