Benchmark Tests
This page explains the statistical tests used to compare rng.dev against established randomness beacons: drand and NIST.
Key insight: If rng.dev's output is statistically indistinguishable from these gold-standard beacons, you can trust it for the same use cases.
What We're Comparing
Each beacon produces random hash output per round. We compare the raw hash bytes (0-255) directly for maximum sensitivity — with 256 possible byte values, we can detect subtle biases that would be invisible when reduced to 6 die values.
| Beacon | Output | Cadence | Bytes per Hash |
|---|---|---|---|
| rng.dev | 256-bit SHA3-256 hash | 1 second | 32 bytes |
| drand | 256-bit BLS signature hash | 3 seconds (quicknet) | 32 bytes |
| NIST | 512-bit hash | 60 seconds | 64 bytes |
For comparison, we:
- Collect N rounds from each beacon (e.g., 1,000 rounds)
- Extract all bytes from each round's hash (32 bytes × 1,000 = 32,000 bytes)
- Run Kolmogorov-Smirnov tests on the byte distributions
- Compare the resulting effect sizes
If all beacons show similar effect sizes near zero, their outputs are statistically equivalent.
Matched Sample Sizes
For fair comparison, we use matched sample sizes. If you request 100,000 rounds but NIST only has 5,000 samples available, we compare 5,000 rounds from each source. The API response explains when sample size is limited by a particular source.
The Benchmark Tests
The benchmark table displays 4 statistical tests on raw hash bytes:
| Test | What It Measures | Effect Size |
|---|---|---|
| K-S Uniformity | Deviation from uniform byte distribution | D-statistic (0-1) |
| K-S Pairwise | Distribution difference between sources | Max D-statistic vs other sources |
| Byte Entropy | Information content of byte distribution | 1 - efficiency (0 = ideal) |
| Serial Corr. | Autocorrelation between consecutive bytes | Max |r| across lags |
1. K-S Pairwise Comparison (Row: "K-S Pairwise")
What it tests: Are the byte distributions from two beacons indistinguishable?
How it works:
- Collect hash bytes (0-255) from both beacons
- Run a two-sample Kolmogorov-Smirnov test
- Calculate the D-statistic (maximum difference between cumulative distributions)
- Use D-statistic as effect size
Pairwise comparisons performed:
- rng.dev vs drand
- rng.dev vs NIST
- drand vs NIST
Why it matters:
If two random sources produce statistically equivalent byte distributions, their outputs are interchangeable for practical purposes. The K-S test is sensitive to differences in shape, location, and scale of distributions.
Interpretation:
The D-statistic is the effect size (0 to 1, lower = more similar):
| D-statistic | Effect | Status |
|---|---|---|
| < 0.10 | Negligible/Weak | ✓ PASS |
| 0.10 - 0.20 | Moderate | ⚠ WATCH |
| > 0.20 | Strong | ✗ INVESTIGATE |
Example result:
Pairwise K-S Comparison (1,000 rounds):
Comparison | D-statistic | Effect Size | Status
----------------|-------------|-------------|--------
rng.dev ↔ drand | 0.008 | negligible | PASS
rng.dev ↔ NIST | 0.011 | negligible | PASS
drand ↔ NIST | 0.007 | negligible | PASS
Table display: Each cell shows that source's maximum effect size vs the other two sources. For example:
- rng.dev cell:
max(rng↔drand, rng↔NIST) - drand cell:
max(drand↔rng, drand↔NIST) - NIST cell:
max(NIST↔rng, NIST↔drand)
2. K-S Uniformity Test (Row: "K-S Uniformity")
What it tests: Does each beacon's byte distribution follow a uniform distribution (0-255)?
How it works:
- Collect all hash bytes from N rounds
- Run a one-sample K-S test against the theoretical uniform(0, 255) distribution
- Calculate the D-statistic as effect size
Why it matters:
A cryptographic hash should produce bytes uniformly distributed across 0-255. Any deviation suggests a flaw in the hash function or entropy source.
Perfect uniform byte distribution:
┌────────────────────────────────────────────────┐
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │
│ Each byte value (0-255) appears equally often │
└────────────────────────────────────────────────┘
Biased distribution (would fail uniformity test):
┌────────────────────────────────────────────────┐
│ ▓▓▓▓▓▓████████████████████▓▓▓▓▓▓ │
│ Some byte values appear more often than others │
└────────────────────────────────────────────────┘
Interpretation:
| D-statistic | Effect | Status |
|---|---|---|
| < 0.10 | Negligible/Weak | ✓ PASS |
| 0.10 - 0.20 | Moderate | ⚠ WATCH |
| > 0.20 | Strong | ✗ INVESTIGATE |
Example result:
Uniformity Test (1,000 rounds):
Beacon | D-statistic | Effect Size | Status
----------|-------------|-------------|--------
rng.dev | 0.006 | negligible | PASS
drand | 0.005 | negligible | PASS
NIST | 0.008 | negligible | PASS
3. Byte Entropy Test (Row: "Byte Entropy")
What it tests: Does the byte distribution contain maximum information content?
How it works:
- Collect all hash bytes from N rounds
- Calculate Shannon entropy: H = -Σ(p × log₂(p))
- Maximum entropy for bytes is log₂(256) = 8 bits
- Efficiency = actual entropy / 8 bits
- Effect size = 1 - efficiency (0 = perfect, 1 = completely biased)
Why it matters:
A cryptographic hash should produce bytes with maximum entropy (8 bits). Lower entropy indicates some byte values appear more frequently than others — a potential bias in the random source.
Perfect entropy (8 bits):
┌────────────────────────────────────────────────┐
│ All 256 byte values equally likely │
│ Entropy = 8.00 bits, Efficiency = 100% │
└────────────────────────────────────────────────┘
Reduced entropy (4 bits):
┌────────────────────────────────────────────────┐
│ Only 16 byte values appear (e.g., 0-15) │
│ Entropy = 4.00 bits, Efficiency = 50% │
│ Effect size = 0.50 → INVESTIGATE │
└────────────────────────────────────────────────┘
Interpretation:
| Effect Size | Efficiency | Status |
|---|---|---|
| < 0.10 | > 90% | ✓ PASS |
| 0.10 - 0.20 | 80-90% | ⚠ WATCH |
| > 0.20 | < 80% | ✗ INVESTIGATE |
4. Serial Correlation Test (Row: "Serial Corr.")
What it tests: Are consecutive bytes correlated (predictable from each other)?
How it works:
- Calculate autocorrelation at lags 1-20
- Autocorrelation r(k) measures how much byte[i] correlates with byte[i+k]
- Effect size = maximum |r| across all lags
- Uses Ljung-Box test for combined significance
Why it matters:
In a random sequence, knowing one byte should provide no information about subsequent bytes. High autocorrelation suggests patterns or predictability that could compromise randomness.
Random data (no correlation):
┌────────────────────────────────────────────────┐
│ Lag 1: r = 0.002 Lag 2: r = -0.001 │
│ Lag 3: r = 0.003 Lag 4: r = 0.001 │
│ Max |r| = 0.003 → Effect size = 0.003 → PASS │
└────────────────────────────────────────────────┘
Periodic pattern detected:
┌────────────────────────────────────────────────┐
│ Lag 1: r = -0.15 Lag 2: r = 0.02 │
│ Lag 3: r = 0.01 Lag 4: r = 0.89 ← period │
│ Max |r| = 0.89 → Effect size = 0.89 → FAIL │
└────────────────────────────────────────────────┘
Interpretation:
Based on Cohen's (1988) conventions for correlation:
| Max |r| | Effect | Status | |---------|--------|--------| | < 0.10 | Negligible | ✓ PASS | | 0.10 - 0.20 | Weak | ⚠ WATCH | | > 0.20 | Moderate+ | ✗ INVESTIGATE |
How to Read the Results
The benchmark table shows effect sizes for each test. Here's how to interpret them:
| Effect Size | Color | Meaning |
|---|---|---|
| < 0.05 | Green | Negligible — excellent randomness |
| 0.05 - 0.10 | Green | Weak — good randomness |
| 0.10 - 0.20 | Yellow | Moderate — worth monitoring |
| > 0.20 | Red | Strong — investigate for issues |
Key points:
- Similar effect sizes across beacons = rng.dev is statistically equivalent
- Effect sizes near zero = distributions are indistinguishable from ideal
- All beacons should behave similarly — if one shows high effect size and others don't, investigate
Why effect size instead of p-values?
At large sample sizes (10,000+ rounds), p-values become misleading. A 0.1% deviation from perfect uniformity can produce p < 0.001, but such tiny deviations have no practical impact. Effect size measures the magnitude of deviation, which stays meaningful regardless of sample size.
Why Compare Against drand and NIST?
| Beacon | Why It's a Gold Standard |
|---|---|
| drand | Threshold BLS signatures from 20+ independent operators; mathematically provable randomness |
| NIST | US government standard; hardware-sourced entropy; decades of cryptographic research |
If rng.dev's statistical properties match these established beacons, you can trust it for equivalent use cases. The comparison provides empirical evidence that our blockchain-derived randomness is as good as purpose-built randomness beacons.
Why Raw Bytes Instead of Die Values?
Previous versions compared die values (1-6) derived from each beacon's hash. We now compare raw hash bytes (0-255) for several reasons:
| Approach | Categories | Sensitivity | Why |
|---|---|---|---|
| Die values (1-6) | 6 | Lower | Collapsing 256 byte values to 6 categories loses information |
| Hash bytes (0-255) | 256 | Higher | Full resolution detects subtle biases |
Example:
If a beacon produces bytes 0-127 slightly more often than 128-255, die value derivation would hide this (all map to valid die values). Raw byte comparison catches it immediately.
Most users consume the hash directly, not the die value, so comparing hashes directly is more relevant to real-world usage.
Sample Size Considerations
The benchmark table lets you select different sample sizes:
| Sample Size | Statistical Power | Best For |
|---|---|---|
| 100 rounds | Low — can miss subtle bias | Quick sanity check |
| 1,000 rounds | Medium — catches most issues | Standard monitoring |
| 10,000 rounds | High — detects subtle patterns | Deep analysis |
| 100,000 rounds | Very high — rigorous validation | Publication-quality claims |
Larger samples provide more statistical power but take longer to collect. The default (1,000 rounds) balances responsiveness with reliability.
API Response
The /api/v1/comparisons endpoint returns:
{
"sample_counts": {
"beacon": 1000,
"drand": 1000,
"nist": 1000
},
"byte_counts": {
"beacon": 32000,
"drand": 32000,
"nist": 64000
},
"pairwise_comparisons": [
{
"source1": "beacon",
"source2": "drand",
"statistic": 0.008,
"p_value": 0.92,
"effect_size": 0.008,
"n1": 32000,
"n2": 32000,
"interpretation": "Byte distributions are virtually identical",
"status": "PASS"
}
],
"uniformity_tests": [
{
"source": "beacon",
"statistic": 0.006,
"p_value": 0.95,
"effect_size": 0.006,
"n": 32000,
"interpretation": "Byte distribution virtually uniform",
"status": "PASS"
}
],
"entropy_tests": [
{
"source": "beacon",
"entropy": 7.999,
"max_entropy": 8.0,
"efficiency": 0.9999,
"effect_size": 0.0001,
"n": 32000,
"status": "PASS"
}
],
"serial_correlation_tests": [
{
"source": "beacon",
"max_correlation": 0.012,
"effect_size": 0.012,
"p_value": 0.89,
"status": "PASS"
}
],
"overall_status": "PASS",
"plain_english": "Compared 1,000 samples from each source. All distributions are statistically indistinguishable — beacon randomness matches known reference sources."
}
Relationship to Die Value Tests
| Test Type | Purpose | Documented In |
|---|---|---|
| Benchmark tests (this page) | Compare rng.dev against drand & NIST using raw hash bytes | Cross-beacon validation |
| Die value tests | Validate rng.dev's die value output quality | Statistical Tests |
Both test types analyze beacon output quality. The difference is:
- Benchmark tests: Compare raw hash bytes against other trusted beacons (maximum sensitivity)
- Die value tests: Analyze die values (1-6) derived from our hashes (relevant for dice-rolling use cases)
Both should pass for a well-functioning beacon.
Further Reading
- Statistical Tests — Tests for die value output
- How It Works — Beacon generation process
- NIST SP 800-22 — Statistical test suite for RNGs
- drand Documentation — Threshold randomness beacon