Benchmark Tests

This page explains the statistical tests used to compare rng.dev against established randomness beacons: drand and NIST.

Key insight: If rng.dev's output is statistically indistinguishable from these gold-standard beacons, you can trust it for the same use cases.

What We're Comparing

Each beacon produces random hash output per round. We compare the raw hash bytes (0-255) directly for maximum sensitivity — with 256 possible byte values, we can detect subtle biases that would be invisible when reduced to 6 die values.

Beacon	Output	Cadence	Bytes per Hash
rng.dev	256-bit SHA3-256 hash	1 second	32 bytes
drand	256-bit BLS signature hash	3 seconds (quicknet)	32 bytes
NIST	512-bit hash	60 seconds	64 bytes

For comparison, we:

Collect N rounds from each beacon (e.g., 1,000 rounds)
Extract all bytes from each round's hash (32 bytes × 1,000 = 32,000 bytes)
Run Kolmogorov-Smirnov tests on the byte distributions
Compare the resulting effect sizes

If all beacons show similar effect sizes near zero, their outputs are statistically equivalent.

Matched Sample Sizes

For fair comparison, we use matched sample sizes. If you request 100,000 rounds but NIST only has 5,000 samples available, we compare 5,000 rounds from each source. The API response explains when sample size is limited by a particular source.

The Benchmark Tests

The benchmark table displays 4 statistical tests on raw hash bytes:

Test	What It Measures	Effect Size
K-S Uniformity	Deviation from uniform byte distribution	D-statistic (0-1)
K-S Pairwise	Distribution difference between sources	Max D-statistic vs other sources
Byte Entropy	Information content of byte distribution	1 - efficiency (0 = ideal)
Serial Corr.	Autocorrelation between consecutive bytes	Max \|r\| across lags

1. K-S Pairwise Comparison (Row: "K-S Pairwise")

What it tests: Are the byte distributions from two beacons indistinguishable?

How it works:

Collect hash bytes (0-255) from both beacons
Run a two-sample Kolmogorov-Smirnov test
Calculate the D-statistic (maximum difference between cumulative distributions)
Use D-statistic as effect size

Pairwise comparisons performed:

rng.dev vs drand
rng.dev vs NIST
drand vs NIST

Why it matters:

If two random sources produce statistically equivalent byte distributions, their outputs are interchangeable for practical purposes. The K-S test is sensitive to differences in shape, location, and scale of distributions.

Interpretation:

The D-statistic is the effect size (0 to 1, lower = more similar):

D-statistic	Effect	Status
< 0.10	Negligible/Weak	✓ PASS
0.10 - 0.20	Moderate	⚠ WATCH
> 0.20	Strong	✗ INVESTIGATE

Example result:

Pairwise K-S Comparison (1,000 rounds):

Comparison      | D-statistic | Effect Size | Status
----------------|-------------|-------------|--------
rng.dev ↔ drand |   0.008     |  negligible |  PASS
rng.dev ↔ NIST  |   0.011     |  negligible |  PASS
drand ↔ NIST    |   0.007     |  negligible |  PASS

Table display: Each cell shows that source's maximum effect size vs the other two sources. For example:

rng.dev cell: max(rng↔drand, rng↔NIST)
drand cell: max(drand↔rng, drand↔NIST)
NIST cell: max(NIST↔rng, NIST↔drand)

2. K-S Uniformity Test (Row: "K-S Uniformity")

What it tests: Does each beacon's byte distribution follow a uniform distribution (0-255)?

How it works:

Collect all hash bytes from N rounds
Run a one-sample K-S test against the theoretical uniform(0, 255) distribution
Calculate the D-statistic as effect size

Why it matters:

A cryptographic hash should produce bytes uniformly distributed across 0-255. Any deviation suggests a flaw in the hash function or entropy source.

Perfect uniform byte distribution:
┌────────────────────────────────────────────────┐
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │
│ Each byte value (0-255) appears equally often   │
└────────────────────────────────────────────────┘

Biased distribution (would fail uniformity test):
┌────────────────────────────────────────────────┐
│ ▓▓▓▓▓▓████████████████████▓▓▓▓▓▓                │
│ Some byte values appear more often than others  │
└────────────────────────────────────────────────┘

Interpretation:

D-statistic	Effect	Status
< 0.10	Negligible/Weak	✓ PASS
0.10 - 0.20	Moderate	⚠ WATCH
> 0.20	Strong	✗ INVESTIGATE

Example result:

Uniformity Test (1,000 rounds):

Beacon    | D-statistic | Effect Size | Status
----------|-------------|-------------|--------
rng.dev   |   0.006     |  negligible |  PASS
drand     |   0.005     |  negligible |  PASS
NIST      |   0.008     |  negligible |  PASS

3. Byte Entropy Test (Row: "Byte Entropy")

What it tests: Does the byte distribution contain maximum information content?

How it works:

Collect all hash bytes from N rounds
Calculate Shannon entropy: H = -Σ(p × log₂(p))
Maximum entropy for bytes is log₂(256) = 8 bits
Efficiency = actual entropy / 8 bits
Effect size = 1 - efficiency (0 = perfect, 1 = completely biased)

Why it matters:

A cryptographic hash should produce bytes with maximum entropy (8 bits). Lower entropy indicates some byte values appear more frequently than others — a potential bias in the random source.

Perfect entropy (8 bits):
┌────────────────────────────────────────────────┐
│ All 256 byte values equally likely             │
│ Entropy = 8.00 bits, Efficiency = 100%         │
└────────────────────────────────────────────────┘

Reduced entropy (4 bits):
┌────────────────────────────────────────────────┐
│ Only 16 byte values appear (e.g., 0-15)        │
│ Entropy = 4.00 bits, Efficiency = 50%          │
│ Effect size = 0.50 → INVESTIGATE               │
└────────────────────────────────────────────────┘

Interpretation:

Effect Size	Efficiency	Status
< 0.10	> 90%	✓ PASS
0.10 - 0.20	80-90%	⚠ WATCH
> 0.20	< 80%	✗ INVESTIGATE

4. Serial Correlation Test (Row: "Serial Corr.")

What it tests: Are consecutive bytes correlated (predictable from each other)?

How it works:

Calculate autocorrelation at lags 1-20
Autocorrelation r(k) measures how much byte[i] correlates with byte[i+k]
Effect size = maximum |r| across all lags
Uses Ljung-Box test for combined significance

Why it matters:

In a random sequence, knowing one byte should provide no information about subsequent bytes. High autocorrelation suggests patterns or predictability that could compromise randomness.

Random data (no correlation):
┌────────────────────────────────────────────────┐
│ Lag  1: r = 0.002    Lag  2: r = -0.001        │
│ Lag  3: r = 0.003    Lag  4: r = 0.001         │
│ Max |r| = 0.003 → Effect size = 0.003 → PASS   │
└────────────────────────────────────────────────┘

Periodic pattern detected:
┌────────────────────────────────────────────────┐
│ Lag  1: r = -0.15    Lag  2: r = 0.02          │
│ Lag  3: r = 0.01     Lag  4: r = 0.89 ← period │
│ Max |r| = 0.89 → Effect size = 0.89 → FAIL     │
└────────────────────────────────────────────────┘

Interpretation:

Based on Cohen's (1988) conventions for correlation:

| Max |r| | Effect | Status | |---------|--------|--------| | < 0.10 | Negligible | ✓ PASS | | 0.10 - 0.20 | Weak | ⚠ WATCH | | > 0.20 | Moderate+ | ✗ INVESTIGATE |

How to Read the Results

The benchmark table shows effect sizes for each test. Here's how to interpret them:

Effect Size	Color	Meaning
< 0.05	Green	Negligible — excellent randomness
0.05 - 0.10	Green	Weak — good randomness
0.10 - 0.20	Yellow	Moderate — worth monitoring
> 0.20	Red	Strong — investigate for issues

Key points:

Similar effect sizes across beacons = rng.dev is statistically equivalent
Effect sizes near zero = distributions are indistinguishable from ideal
All beacons should behave similarly — if one shows high effect size and others don't, investigate

Why effect size instead of p-values?

At large sample sizes (10,000+ rounds), p-values become misleading. A 0.1% deviation from perfect uniformity can produce p < 0.001, but such tiny deviations have no practical impact. Effect size measures the magnitude of deviation, which stays meaningful regardless of sample size.

Why Compare Against drand and NIST?

Beacon	Why It's a Gold Standard
drand	Threshold BLS signatures from 20+ independent operators; mathematically provable randomness
NIST	US government standard; hardware-sourced entropy; decades of cryptographic research

If rng.dev's statistical properties match these established beacons, you can trust it for equivalent use cases. The comparison provides empirical evidence that our blockchain-derived randomness is as good as purpose-built randomness beacons.

Why Raw Bytes Instead of Die Values?

Previous versions compared die values (1-6) derived from each beacon's hash. We now compare raw hash bytes (0-255) for several reasons:

Approach	Categories	Sensitivity	Why
Die values (1-6)	6	Lower	Collapsing 256 byte values to 6 categories loses information
Hash bytes (0-255)	256	Higher	Full resolution detects subtle biases

Example:

If a beacon produces bytes 0-127 slightly more often than 128-255, die value derivation would hide this (all map to valid die values). Raw byte comparison catches it immediately.

Most users consume the hash directly, not the die value, so comparing hashes directly is more relevant to real-world usage.

Sample Size Considerations

The benchmark table lets you select different sample sizes:

Sample Size	Statistical Power	Best For
100 rounds	Low — can miss subtle bias	Quick sanity check
1,000 rounds	Medium — catches most issues	Standard monitoring
10,000 rounds	High — detects subtle patterns	Deep analysis
100,000 rounds	Very high — rigorous validation	Publication-quality claims

Larger samples provide more statistical power but take longer to collect. The default (1,000 rounds) balances responsiveness with reliability.

API Response

The /api/v1/comparisons endpoint returns:

{
  "sample_counts": {
    "beacon": 1000,
    "drand": 1000,
    "nist": 1000
  },
  "byte_counts": {
    "beacon": 32000,
    "drand": 32000,
    "nist": 64000
  },
  "pairwise_comparisons": [
    {
      "source1": "beacon",
      "source2": "drand",
      "statistic": 0.008,
      "p_value": 0.92,
      "effect_size": 0.008,
      "n1": 32000,
      "n2": 32000,
      "interpretation": "Byte distributions are virtually identical",
      "status": "PASS"
    }
  ],
  "uniformity_tests": [
    {
      "source": "beacon",
      "statistic": 0.006,
      "p_value": 0.95,
      "effect_size": 0.006,
      "n": 32000,
      "interpretation": "Byte distribution virtually uniform",
      "status": "PASS"
    }
  ],
  "entropy_tests": [
    {
      "source": "beacon",
      "entropy": 7.999,
      "max_entropy": 8.0,
      "efficiency": 0.9999,
      "effect_size": 0.0001,
      "n": 32000,
      "status": "PASS"
    }
  ],
  "serial_correlation_tests": [
    {
      "source": "beacon",
      "max_correlation": 0.012,
      "effect_size": 0.012,
      "p_value": 0.89,
      "status": "PASS"
    }
  ],
  "overall_status": "PASS",
  "plain_english": "Compared 1,000 samples from each source. All distributions are statistically indistinguishable — beacon randomness matches known reference sources."
}

Relationship to Die Value Tests

Test Type	Purpose	Documented In
Benchmark tests (this page)	Compare rng.dev against drand & NIST using raw hash bytes	Cross-beacon validation
Die value tests	Validate rng.dev's die value output quality	Statistical Tests

Both test types analyze beacon output quality. The difference is:

Benchmark tests: Compare raw hash bytes against other trusted beacons (maximum sensitivity)
Die value tests: Analyze die values (1-6) derived from our hashes (relevant for dice-rolling use cases)

Both should pass for a well-functioning beacon.

What We're Comparing​

Matched Sample Sizes​

The Benchmark Tests​

1. K-S Pairwise Comparison (Row: "K-S Pairwise")​

2. K-S Uniformity Test (Row: "K-S Uniformity")​

3. Byte Entropy Test (Row: "Byte Entropy")​

4. Serial Correlation Test (Row: "Serial Corr.")​

How to Read the Results​

Why Compare Against drand and NIST?​

Why Raw Bytes Instead of Die Values?​

Sample Size Considerations​

API Response​

Relationship to Die Value Tests​

Further Reading​

What We're Comparing

Matched Sample Sizes

The Benchmark Tests

1. K-S Pairwise Comparison (Row: "K-S Pairwise")

2. K-S Uniformity Test (Row: "K-S Uniformity")

3. Byte Entropy Test (Row: "Byte Entropy")

4. Serial Correlation Test (Row: "Serial Corr.")

How to Read the Results

Why Compare Against drand and NIST?

Why Raw Bytes Instead of Die Values?

Sample Size Considerations

API Response

Relationship to Die Value Tests

Further Reading