Skip to main content

Benchmark Tests

This page explains the statistical tests used to compare rng.dev against established randomness beacons: drand and NIST.

Key insight: If rng.dev's output is statistically indistinguishable from these gold-standard beacons, you can trust it for the same use cases.


What We're Comparing

Each beacon produces random hash output per round. We compare the raw hash bytes (0-255) directly for maximum sensitivity — with 256 possible byte values, we can detect subtle biases that would be invisible when reduced to 6 die values.

BeaconOutputCadenceBytes per Hash
rng.dev256-bit SHA3-256 hash1 second32 bytes
drand256-bit BLS signature hash3 seconds (quicknet)32 bytes
NIST512-bit hash60 seconds64 bytes

For comparison, we:

  1. Collect N rounds from each beacon (e.g., 1,000 rounds)
  2. Extract all bytes from each round's hash (32 bytes × 1,000 = 32,000 bytes)
  3. Run Kolmogorov-Smirnov tests on the byte distributions
  4. Compare the resulting effect sizes

If all beacons show similar effect sizes near zero, their outputs are statistically equivalent.

Matched Sample Sizes

For fair comparison, we use matched sample sizes. If you request 100,000 rounds but NIST only has 5,000 samples available, we compare 5,000 rounds from each source. The API response explains when sample size is limited by a particular source.


The Benchmark Tests

The benchmark table displays 4 statistical tests on raw hash bytes:

TestWhat It MeasuresEffect Size
K-S UniformityDeviation from uniform byte distributionD-statistic (0-1)
K-S PairwiseDistribution difference between sourcesMax D-statistic vs other sources
Byte EntropyInformation content of byte distribution1 - efficiency (0 = ideal)
Serial Corr.Autocorrelation between consecutive bytesMax |r| across lags

1. K-S Pairwise Comparison (Row: "K-S Pairwise")

What it tests: Are the byte distributions from two beacons indistinguishable?

How it works:

  1. Collect hash bytes (0-255) from both beacons
  2. Run a two-sample Kolmogorov-Smirnov test
  3. Calculate the D-statistic (maximum difference between cumulative distributions)
  4. Use D-statistic as effect size

Pairwise comparisons performed:

  • rng.dev vs drand
  • rng.dev vs NIST
  • drand vs NIST

Why it matters:

If two random sources produce statistically equivalent byte distributions, their outputs are interchangeable for practical purposes. The K-S test is sensitive to differences in shape, location, and scale of distributions.

Interpretation:

The D-statistic is the effect size (0 to 1, lower = more similar):

D-statisticEffectStatus
< 0.10Negligible/Weak✓ PASS
0.10 - 0.20Moderate⚠ WATCH
> 0.20Strong✗ INVESTIGATE

Example result:

Pairwise K-S Comparison (1,000 rounds):

Comparison | D-statistic | Effect Size | Status
----------------|-------------|-------------|--------
rng.dev ↔ drand | 0.008 | negligible | PASS
rng.dev ↔ NIST | 0.011 | negligible | PASS
drand ↔ NIST | 0.007 | negligible | PASS

Table display: Each cell shows that source's maximum effect size vs the other two sources. For example:

  • rng.dev cell: max(rng↔drand, rng↔NIST)
  • drand cell: max(drand↔rng, drand↔NIST)
  • NIST cell: max(NIST↔rng, NIST↔drand)

2. K-S Uniformity Test (Row: "K-S Uniformity")

What it tests: Does each beacon's byte distribution follow a uniform distribution (0-255)?

How it works:

  1. Collect all hash bytes from N rounds
  2. Run a one-sample K-S test against the theoretical uniform(0, 255) distribution
  3. Calculate the D-statistic as effect size

Why it matters:

A cryptographic hash should produce bytes uniformly distributed across 0-255. Any deviation suggests a flaw in the hash function or entropy source.

Perfect uniform byte distribution:
┌────────────────────────────────────────────────┐
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │
│ Each byte value (0-255) appears equally often │
└────────────────────────────────────────────────┘

Biased distribution (would fail uniformity test):
┌────────────────────────────────────────────────┐
│ ▓▓▓▓▓▓████████████████████▓▓▓▓▓▓ │
│ Some byte values appear more often than others │
└────────────────────────────────────────────────┘

Interpretation:

D-statisticEffectStatus
< 0.10Negligible/Weak✓ PASS
0.10 - 0.20Moderate⚠ WATCH
> 0.20Strong✗ INVESTIGATE

Example result:

Uniformity Test (1,000 rounds):

Beacon | D-statistic | Effect Size | Status
----------|-------------|-------------|--------
rng.dev | 0.006 | negligible | PASS
drand | 0.005 | negligible | PASS
NIST | 0.008 | negligible | PASS

3. Byte Entropy Test (Row: "Byte Entropy")

What it tests: Does the byte distribution contain maximum information content?

How it works:

  1. Collect all hash bytes from N rounds
  2. Calculate Shannon entropy: H = -Σ(p × log₂(p))
  3. Maximum entropy for bytes is log₂(256) = 8 bits
  4. Efficiency = actual entropy / 8 bits
  5. Effect size = 1 - efficiency (0 = perfect, 1 = completely biased)

Why it matters:

A cryptographic hash should produce bytes with maximum entropy (8 bits). Lower entropy indicates some byte values appear more frequently than others — a potential bias in the random source.

Perfect entropy (8 bits):
┌────────────────────────────────────────────────┐
│ All 256 byte values equally likely │
│ Entropy = 8.00 bits, Efficiency = 100% │
└────────────────────────────────────────────────┘

Reduced entropy (4 bits):
┌────────────────────────────────────────────────┐
│ Only 16 byte values appear (e.g., 0-15) │
│ Entropy = 4.00 bits, Efficiency = 50% │
│ Effect size = 0.50 → INVESTIGATE │
└────────────────────────────────────────────────┘

Interpretation:

Effect SizeEfficiencyStatus
< 0.10> 90%✓ PASS
0.10 - 0.2080-90%⚠ WATCH
> 0.20< 80%✗ INVESTIGATE

4. Serial Correlation Test (Row: "Serial Corr.")

What it tests: Are consecutive bytes correlated (predictable from each other)?

How it works:

  1. Calculate autocorrelation at lags 1-20
  2. Autocorrelation r(k) measures how much byte[i] correlates with byte[i+k]
  3. Effect size = maximum |r| across all lags
  4. Uses Ljung-Box test for combined significance

Why it matters:

In a random sequence, knowing one byte should provide no information about subsequent bytes. High autocorrelation suggests patterns or predictability that could compromise randomness.

Random data (no correlation):
┌────────────────────────────────────────────────┐
│ Lag 1: r = 0.002 Lag 2: r = -0.001 │
│ Lag 3: r = 0.003 Lag 4: r = 0.001 │
│ Max |r| = 0.003 → Effect size = 0.003 → PASS │
└────────────────────────────────────────────────┘

Periodic pattern detected:
┌────────────────────────────────────────────────┐
│ Lag 1: r = -0.15 Lag 2: r = 0.02 │
│ Lag 3: r = 0.01 Lag 4: r = 0.89 ← period │
│ Max |r| = 0.89 → Effect size = 0.89 → FAIL │
└────────────────────────────────────────────────┘

Interpretation:

Based on Cohen's (1988) conventions for correlation:

| Max |r| | Effect | Status | |---------|--------|--------| | < 0.10 | Negligible | ✓ PASS | | 0.10 - 0.20 | Weak | ⚠ WATCH | | > 0.20 | Moderate+ | ✗ INVESTIGATE |


How to Read the Results

The benchmark table shows effect sizes for each test. Here's how to interpret them:

Effect SizeColorMeaning
< 0.05GreenNegligible — excellent randomness
0.05 - 0.10GreenWeak — good randomness
0.10 - 0.20YellowModerate — worth monitoring
> 0.20RedStrong — investigate for issues

Key points:

  1. Similar effect sizes across beacons = rng.dev is statistically equivalent
  2. Effect sizes near zero = distributions are indistinguishable from ideal
  3. All beacons should behave similarly — if one shows high effect size and others don't, investigate

Why effect size instead of p-values?

At large sample sizes (10,000+ rounds), p-values become misleading. A 0.1% deviation from perfect uniformity can produce p < 0.001, but such tiny deviations have no practical impact. Effect size measures the magnitude of deviation, which stays meaningful regardless of sample size.


Why Compare Against drand and NIST?

BeaconWhy It's a Gold Standard
drandThreshold BLS signatures from 20+ independent operators; mathematically provable randomness
NISTUS government standard; hardware-sourced entropy; decades of cryptographic research

If rng.dev's statistical properties match these established beacons, you can trust it for equivalent use cases. The comparison provides empirical evidence that our blockchain-derived randomness is as good as purpose-built randomness beacons.


Why Raw Bytes Instead of Die Values?

Previous versions compared die values (1-6) derived from each beacon's hash. We now compare raw hash bytes (0-255) for several reasons:

ApproachCategoriesSensitivityWhy
Die values (1-6)6LowerCollapsing 256 byte values to 6 categories loses information
Hash bytes (0-255)256HigherFull resolution detects subtle biases

Example:

If a beacon produces bytes 0-127 slightly more often than 128-255, die value derivation would hide this (all map to valid die values). Raw byte comparison catches it immediately.

Most users consume the hash directly, not the die value, so comparing hashes directly is more relevant to real-world usage.


Sample Size Considerations

The benchmark table lets you select different sample sizes:

Sample SizeStatistical PowerBest For
100 roundsLow — can miss subtle biasQuick sanity check
1,000 roundsMedium — catches most issuesStandard monitoring
10,000 roundsHigh — detects subtle patternsDeep analysis
100,000 roundsVery high — rigorous validationPublication-quality claims

Larger samples provide more statistical power but take longer to collect. The default (1,000 rounds) balances responsiveness with reliability.


API Response

The /api/v1/comparisons endpoint returns:

{
"sample_counts": {
"beacon": 1000,
"drand": 1000,
"nist": 1000
},
"byte_counts": {
"beacon": 32000,
"drand": 32000,
"nist": 64000
},
"pairwise_comparisons": [
{
"source1": "beacon",
"source2": "drand",
"statistic": 0.008,
"p_value": 0.92,
"effect_size": 0.008,
"n1": 32000,
"n2": 32000,
"interpretation": "Byte distributions are virtually identical",
"status": "PASS"
}
],
"uniformity_tests": [
{
"source": "beacon",
"statistic": 0.006,
"p_value": 0.95,
"effect_size": 0.006,
"n": 32000,
"interpretation": "Byte distribution virtually uniform",
"status": "PASS"
}
],
"entropy_tests": [
{
"source": "beacon",
"entropy": 7.999,
"max_entropy": 8.0,
"efficiency": 0.9999,
"effect_size": 0.0001,
"n": 32000,
"status": "PASS"
}
],
"serial_correlation_tests": [
{
"source": "beacon",
"max_correlation": 0.012,
"effect_size": 0.012,
"p_value": 0.89,
"status": "PASS"
}
],
"overall_status": "PASS",
"plain_english": "Compared 1,000 samples from each source. All distributions are statistically indistinguishable — beacon randomness matches known reference sources."
}

Relationship to Die Value Tests

Test TypePurposeDocumented In
Benchmark tests (this page)Compare rng.dev against drand & NIST using raw hash bytesCross-beacon validation
Die value testsValidate rng.dev's die value output qualityStatistical Tests

Both test types analyze beacon output quality. The difference is:

  • Benchmark tests: Compare raw hash bytes against other trusted beacons (maximum sensitivity)
  • Die value tests: Analyze die values (1-6) derived from our hashes (relevant for dice-rolling use cases)

Both should pass for a well-functioning beacon.


Further Reading