Skip to main content

Benchmark Tests

This page explains the statistical tests used to compare rng.dev against established randomness beacons: drand and NIST.

Key insight: If rng.dev's output is statistically indistinguishable from these gold-standard beacons, you can trust it for the same use cases.


What We're Comparing

Each beacon produces a 256-bit (32-byte) hash per round. We compare the statistical properties of these hash bytes across beacons.

BeaconOutputCadence
rng.dev256-bit SHA3-256 hash1 second
drand256-bit BLS signature hash3 seconds (quicknet)
NIST512-bit hash60 seconds

For comparison, we:

  1. Collect N rounds from each beacon (e.g., 1,000 rounds)
  2. Extract the hash bytes from each round
  3. Run identical statistical tests on each beacon's output
  4. Compare the resulting p-values

If all beacons show similar p-values, their outputs are statistically equivalent.


The Four Benchmark Tests

1. Kolmogorov-Smirnov (K-S) Test

What it tests: Do the hash bytes follow a uniform distribution?

How it works:

  1. Extract all bytes from N rounds (N × 32 bytes for 256-bit hashes)
  2. Build the empirical cumulative distribution function (ECDF)
  3. Compare ECDF to the theoretical uniform distribution (0-255)
  4. Calculate the maximum deviation (K-S statistic)

Why it matters:

A good random hash should produce bytes uniformly distributed across 0-255. Any deviation suggests bias in the underlying generation process.

Perfect uniform distribution:
┌────────────────────────────────────┐
│ ████████████████████████████████ │
│ ████████████████████████████████ │
│ ████████████████████████████████ │
└────────────────────────────────────┘
0 255

Biased distribution (would fail K-S):
┌────────────────────────────────────┐
│ ██████████████████████ │
│ ████████████████ │
│ ████████████ │
└────────────────────────────────────┘
0 128 255

Interpretation:

  • p > 0.05: Byte distribution is consistent with uniform random
  • p < 0.01: Statistically significant deviation from uniform

Example result:

K-S Test Results (1,000 rounds):

Beacon | D-statistic | p-value | Status
----------|-------------|---------|--------
rng.dev | 0.0089 | 0.89 | PASS
drand | 0.0092 | 0.87 | PASS
NIST | 0.0098 | 0.84 | PASS

2. Chi-Squared Test

What it tests: Do all byte values (0-255) appear with equal frequency?

How it works:

  1. Count occurrences of each byte value (0-255) across all rounds
  2. Compare observed counts to expected counts (total_bytes / 256)
  3. Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected
  4. Convert to p-value using chi-squared distribution (255 degrees of freedom)

Why it matters:

The K-S test checks the overall shape of the distribution. Chi-squared checks whether specific byte values are over- or under-represented.

For 1,000 rounds × 32 bytes = 32,000 bytes:
- Expected per value: 32,000 / 256 = 125 occurrences
- Each value should appear ~125 times (±11 for 95% CI)

Interpretation:

  • p > 0.05: All byte values appear with expected frequency
  • p < 0.01: Some bytes appear too often or too rarely

Example result:

Chi-Squared Test (1,000 rounds):

Beacon | χ² statistic | p-value | Status
----------|--------------|---------|--------
rng.dev | 248.3 | 0.92 | PASS
drand | 251.7 | 0.94 | PASS
NIST | 246.1 | 0.91 | PASS

3. Runs Test

What it tests: Do sequences of increasing/decreasing bytes occur at the expected rate?

How it works:

  1. Compare each byte to the next: is it larger (+) or smaller (-)?
  2. Count "runs" — consecutive sequences of same direction
  3. Compare run count to expected value for random sequences
  4. Calculate p-value using normal approximation

Why it matters:

Even if individual bytes are uniformly distributed, they might follow patterns. The runs test detects:

  • Too much alternation (up-down-up-down)
  • Too much momentum (up-up-up-up)
  • Hidden sequential structure
Example byte sequence: [42, 88, 91, 67, 45, 78, 234, 12]
Directions: + + - - + + -
Runs: |──1──|──2──|──3──|───4───|

Expected runs for n bytes: (2n - 1) / 3

Interpretation:

  • p > 0.05: Run count is normal for random data
  • p < 0.01: Sequence has abnormal patterns

Example result:

Runs Test (1,000 rounds):

Beacon | Observed Runs | Expected | p-value | Status
----------|---------------|----------|---------|--------
rng.dev | 21,287 | 21,333 | 0.71 | PASS
drand | 21,198 | 21,333 | 0.68 | PASS
NIST | 21,156 | 21,333 | 0.67 | PASS

4. Serial Correlation

What it tests: Is there correlation between consecutive bytes?

How it works:

  1. For each byte pair (b[i], b[i+1]), compute their correlation
  2. Calculate Pearson correlation coefficient across all pairs
  3. Test whether correlation is significantly different from zero
  4. Convert to p-value using t-distribution

Why it matters:

Perfect random data has zero correlation between consecutive values. Serial correlation detects:

  • Linear predictability (knowing byte N helps predict byte N+1)
  • Lagged dependencies
  • Poor mixing in the hash function
Zero correlation (ideal):
byte[i+1]
│ · · · · ·
│ · · · · · ·
│ · · · · · ·
└─────────────────→ byte[i]
(random scatter, no pattern)

Positive correlation (bad):
byte[i+1]
│ · · ·
│ · · ·
│ · · ·
└─────────────────→ byte[i]
(larger bytes followed by larger bytes)

Interpretation:

  • p > 0.05: No significant correlation (good)
  • p < 0.01: Consecutive bytes are correlated (bad)

Example result:

Serial Correlation (1,000 rounds):

Beacon | Correlation | p-value | Status
----------|-------------|---------|--------
rng.dev | -0.0021 | 0.82 | PASS
drand | 0.0034 | 0.77 | PASS
NIST | -0.0028 | 0.79 | PASS

How to Read the Results

The benchmark table shows p-values for each test. Here's how to interpret them:

P-Value RangeColorMeaning
> 0.10GreenStrong pass — well within expected range
0.05 - 0.10GreenPass — acceptable
0.01 - 0.05YellowBorderline — worth monitoring
< 0.01RedInvestigate — statistically unusual

Key points:

  1. Similar p-values across beacons = rng.dev is statistically equivalent
  2. Occasional low p-values are normal — 5% of tests fail by chance
  3. All beacons should behave similarly — if one fails and others pass, investigate

Why Compare Against drand and NIST?

BeaconWhy It's a Gold Standard
drandThreshold BLS signatures from 20+ independent operators; mathematically provable randomness
NISTUS government standard; hardware-sourced entropy; decades of cryptographic research

If rng.dev's statistical properties match these established beacons, you can trust it for equivalent use cases. The comparison provides empirical evidence that our blockchain-derived randomness is as good as purpose-built randomness beacons.


Sample Size Considerations

The benchmark table lets you select different sample sizes:

Sample SizeStatistical PowerBest For
100 roundsLow — can miss subtle biasQuick sanity check
1,000 roundsMedium — catches most issuesStandard monitoring
10,000 roundsHigh — detects subtle patternsDeep analysis
100,000 roundsVery high — rigorous validationPublication-quality claims

Larger samples provide more statistical power but take longer to collect. The default (1,000 rounds) balances responsiveness with reliability.


Technical Implementation

For those implementing their own comparison:

import numpy as np
from scipy import stats

def compare_beacons(rng_hashes: list[bytes],
drand_hashes: list[bytes],
nist_hashes: list[bytes]) -> dict:
"""
Compare three beacons using standard statistical tests.
Each hash is a 32-byte (256-bit) value.
"""
results = {}

for name, hashes in [('rng', rng_hashes),
('drand', drand_hashes),
('nist', nist_hashes)]:
# Flatten all bytes
all_bytes = np.array([b for h in hashes for b in h])

# 1. K-S Test: compare to uniform distribution
ks_stat, ks_p = stats.kstest(all_bytes, 'uniform', args=(0, 256))

# 2. Chi-Squared: byte frequency
observed = np.bincount(all_bytes, minlength=256)
expected = len(all_bytes) / 256
chi2_stat, chi2_p = stats.chisquare(observed, [expected] * 256)

# 3. Runs Test: sequential patterns
runs_p = runs_test(all_bytes)

# 4. Serial Correlation
corr, corr_p = stats.pearsonr(all_bytes[:-1], all_bytes[1:])

results[name] = {
'ks': {'stat': ks_stat, 'p': ks_p},
'chi2': {'stat': chi2_stat, 'p': chi2_p},
'runs': {'p': runs_p},
'serial': {'corr': corr, 'p': corr_p}
}

return results

def runs_test(data: np.ndarray) -> float:
"""Wald-Wolfowitz runs test for randomness."""
# Count runs of increasing/decreasing values
diffs = np.diff(data)
signs = np.sign(diffs)
signs = signs[signs != 0] # Remove ties

runs = 1 + np.sum(signs[:-1] != signs[1:])
n = len(signs)

# Expected runs and variance
expected = (2 * n - 1) / 3
variance = (16 * n - 29) / 90

# Z-score and p-value
z = (runs - expected) / np.sqrt(variance)
p = 2 * (1 - stats.norm.cdf(abs(z)))

return p

Relationship to Die Value Tests

Test TypeTargetDocumented In
Benchmark tests (this page)256-bit hash bytesComparing beacons
Die value tests1-6 derived valuesStatistical Tests

The benchmark tests validate the underlying hash quality. The die value tests validate the derived output used for visualization. Both should pass for a well-functioning beacon.


Further Reading