Skip to main content
  1. Posts/

UUID v4 and v7: Collision Incidents and Performance Benchmarks

liuzhilong62
Author
liuzhilong62
PostgreSQL DBA. Writing about database internals, production cases, and source code analysis.

Source material: HN UUID v4 Collision Thread, dev.to UUID Benchmark

AI-generated ratio: 99%

TL;DR
#

UUID v4 collided — someone on HackerNews actually hit a real collision. The root cause was a software stack bug, not math. v4 and v7 have no fundamental difference in collision safety. The real difference is index performance: v7 is time-ordered, B-tree is more compact, writes are 35% faster, indexes are 22% smaller. Your UUID v4 is probably fine, but if you care about index performance, switching to v7 is a cheap win.

The UUID v4 Collision Incident
#

A HackerNews thread blew up — Ask HN: We just had an actual UUID v4 collision…, 479 upvotes, 347 comments.

The OP’s own words:

I know what you’re thinking… and I still can’t believe it, but… This morning, our database flagged a duplicate UUID (v4).

It wasn’t a double-insert bug. The code didn’t write it twice. Only ~15,000 rows in the table, using npm’s uuid package uuidv4(), and two rows created at different times collided on the same UUID:

b6133fd6-70fe-4fe3-bed6-8ca8fc9386cd

What’s the probability of a UUID v4 collision? 122 random bits, 2^122 ≈ 5.3×10^36 possibilities. With 15,000 records, collision probability is roughly 2×10^-29. Theoretically “impossible.”

But it happened.

Cause 1: Unreliable entropy sources
#

HN’s top-voted comment (jandrewrogers):

UUIDv4 security depends on high-quality entropy sources. Hardware defects, software bugs, and misunderstandings of “high-quality entropy” all break this assumption. Detecting entropy source failures is expensive, so nobody checks — until a collision happens.

UUID v4 is explicitly banned in high-reliability systems because entropy source quality cannot be verified.

Cause 2: Known npm uuid package bugs
#

The npm uuid package README itself warns:

This module may generate duplicate UUIDs when run in clients with deterministic random number generators, such as Googlebot crawlers.

More seriously, its internal rng() function has global mutable state. One commenter pointed out: calling rng() and sending the result effectively overwrites someone else’s random number, and you can predict it.

Related commit: 91805f665c

Community advice: use Node.js built-in crypto.randomUUID(), not the npm uuid package.

Cause 3: Linux kernel /dev/random race condition
#

Another comment:

I encountered duplicate UUIDs during soak testing of a distributed system. After extensive debugging, I found it was a Linux kernel race condition bug — on multi-processor systems, two processes simultaneously reading /dev/random could, with extremely low probability (~one in a million), get the same bytes.

Cause 4: Go UUID library not checking return values
#

Early Go UUID libraries called random functions without checking the return value length. “Request N bytes, got 3 bytes back” never happened on most hardware, so nobody checked — until production, where it generated thousands of duplicate UUIDs.

Cause 5: Historical AMD CPU RNG defects
#

Certain AMD CPUs had built-in random number generator issues. VM environments can also “virtualize away” entropy — both time sources and entropy sources can degrade inside VMs.


v4 and v7 have no fundamental difference in collision safety. The difference is in the first 48 bits — v4 is random, v7 is a timestamp. You’re unlikely to encounter timestamp source issues, and random source issues are equally rare. The HN thread is an interesting edge case. Knowing that a tiny number of people hit it is enough — you don’t need to distrust the UUID v4 in your own systems.

When choosing v4 vs v7, what you should really look at isn’t collisions — it’s index performance.

UUID v7 Performance Comparison in PG 16
#

UUID v7 has one concrete advantage over v4 in PostgreSQL: temporal clustering, more B-tree-friendly. v4 can bloat and v7 can bloat too — the difference is simply that v7’s first 48 bits are time-ordered, so inserts concentrate on the right side of the B-tree, reducing page splits.

Umang Sinha’s benchmark ran a rigorous comparison on a PG 16 Docker container (8 cores, 16GB, NVMe).

Test Conditions
#

CREATE TABLE uuid_v4_test (id UUID PRIMARY KEY, payload TEXT);
CREATE TABLE uuid_v7_test (id UUID PRIMARY KEY, payload TEXT);
ParameterValue
Data volume10 million rows per table
Batch size10,000 rows per batch
ClientGo + pq driver
UUID generationPre-generated in memory, not timed

Performance Results
#

MetricUUID v4UUID v7Improvement
Write 10M rows5 min 35 sec3 min 38 sec35% faster
Table + index total size3618 MB3443 MB5% smaller
B-tree index size776 MB602 MB22% smaller
Point lookup0.167 ms0.038 ms4.4x faster
Range scan8.283 ms3.791 ms2.2x faster

Why Such a Big Difference
#

UUID v4 bit structure

UUID v7 bit structure

UUID v4 is fully random. Newly inserted UUIDs scatter randomly across the B-tree index, causing massive page splits and severe index fragmentation. UUID v7 has a millisecond-precision timestamp in the first 48 bits, so newly generated UUIDs are naturally ordered — writes cluster on the right side of the B-tree, page splits drop dramatically, and the index is much more compact.

The 22% smaller index isn’t magic — it’s reduced fragmentation. Point lookups being 4x faster isn’t surprising either — fewer B-tree levels, higher cache hit rates.

Summary
#

UUID v4 and v7 are identical in collision safety — both depend on entropy source quality, one fills the first 48 bits with random numbers, the other with a timestamp. Collisions are edge cases that a tiny number of people hit in specific environments. Your environment is probably fine — that basic judgment doesn’t change.

What you really should think about is index performance. v7’s temporal property makes B-trees more compact, with measured results of 35% faster writes, 22% smaller indexes, and 2-4x faster queries. If your system writes UUIDs at high volume, switching to v7 saves meaningful storage and CPU.

PG 18 will natively support gen_uuid_v7(). For now, generate UUIDs at the application layer. Whichever version you use, always add a UNIQUE constraint.

This article was originally published in Chinese on lastdba.com.

Related

When PostgreSQL Becomes AI's Hands — Bruce Momjian's MCP Server in Practice

·1516 words·8 mins
Original: Building an MCP Server Using Postgres, Bruce Momjian, PGDay Armenia 2026, CC BY 4.0. AI-generated ratio: 80% Bruce Momjian (PG core team, the one who has written release notes for 20+ years) recently gave a talk at PGDay Armenia 2026: Building an MCP Server Using Postgres. 70 slides, extremely dense. Theory and practice — a solid reference. Reading it directly is hard work. Even having AI interpret it probably won’t make sense at first glance. I had to read for a while and ask several questions before it clicked.

From collation mismatch Exception to Its Principles

·3141 words·15 mins
Problem Phenomenon # After physical migration to Xinchuang, occasional errors appear in the pg log, version pg15: WARNING: 01000: collation "zh_CN.utf8" has version mismatch DETAIL: The collation in the database was created using version 2.17, but the operating system provides version 2.28. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."zh_CN.utf8" REFRESH VERSION, or build RaseSQL with the right library version. LOCATION: pg_newlocale_from_collation, pg_locale.c:1660 Context: During the physical switch, invalid index rebuilding and refresh database collation version were performed.

A Brief Review of Logical Replication in Oracle, MySQL, and PostgreSQL

·528 words·3 mins
PostgreSQL Logical Replication # ​​​​ (https://www.pgconf.asia/JA/2017/wp-content/uploads/sites/2/2017/12/D2-A7-EN.pdf) PostgreSQL places all logical decoding related matters entirely within the database’s replication slots for management — an all-inclusive approach. Early versions had somewhat limited logical replication support, but in recent major versions, logical replication has been one of the primary functional improvements. Advantages of the PG approach: Very flexible: it exposes the logical decoding interface to users, with multiple types of decoding methods available. Users can subscribe to only the data they need based on their requirements. Disadvantages of the PG approach: