Skip to main content

Statistical Tests

This page explains the statistical tests used to validate die value randomness quality. These tests run continuously on rolling windows of beacon output.

Note: For tests comparing rng.dev against drand and NIST beacons (hash byte comparison), see Benchmark Tests.


Understanding the Results

P-Values

Every test produces a p-value between 0 and 1. This represents the probability of seeing results this extreme (or more extreme) if the data were truly random.

P-ValueStatusMeaning
> 0.05PASSNo evidence against randomness
0.01 - 0.05WATCHBorderline — monitor for patterns
< 0.01INVESTIGATEStatistically significant deviation

Expected Failures

Random data will sometimes fail random tests. This is not a bug — it's mathematics.

At significance level α = 0.05:

  • 5% of tests will fail even for perfect randomness
  • Running 6 tests means ~26% chance at least one fails per window
  • This is why we apply Bonferroni correction (divide α by number of tests)

The dashboard shows occasional failures as expected behavior, not system problems. Only persistent, repeated failures indicate potential bias.


Test Descriptions

1. Chi-Squared Distribution Test

What it tests: Are all die faces appearing with equal frequency?

How it works:

  1. Count occurrences of each face (1-6)
  2. Compare observed counts to expected counts (n/6 each)
  3. Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected

What it detects:

  • Biased die (one face appears more often)
  • Manufacturing defects in physical dice
  • Software bugs favoring certain values

Interpretation:

  • High p-value (>0.05): Distribution looks uniform
  • Low p-value (<0.01): Some faces appear too often or too rarely

Example:

1000 rolls, expected 166.7 per face

Face | Observed | Expected | Contribution
-----|----------|----------|-------------
1 | 158 | 166.7 | 0.45
2 | 172 | 166.7 | 0.17
3 | 165 | 166.7 | 0.02
4 | 170 | 166.7 | 0.07
5 | 168 | 166.7 | 0.01
6 | 167 | 166.7 | 0.00
─────────
χ² = 0.72

p-value = 0.98 → PASS (no bias detected)

2. Runs Test (Odd/Even Parity)

What it tests: Do odd and even values alternate appropriately?

How it works:

  1. Convert each die value to parity: odd (1,3,5) → 1, even (2,4,6) → 0
  2. Count "runs" — consecutive sequences of same parity
  3. Compare run count to expected value for random sequence

What it detects:

  • Too much alternation (odd-even-odd-even pattern)
  • Too much clustering (odd-odd-odd-odd pattern)
  • Predictable sequencing

Why odd/even instead of above/below median:

  • Discrete die values split cleanly: exactly 3 odd, 3 even
  • Avoids ambiguity at median (3.5)
  • Better statistical properties for 6-sided dice

Interpretation:

  • High p-value: Runs count is normal for random data
  • Low p-value: Sequence is too clustered or too alternating

3. Streak Distribution

What it tests: Do consecutive same-value sequences follow expected lengths?

How it works:

  1. Find all "streaks" — consecutive identical values
  2. Count streaks of each length (1, 2, 3, 4+)
  3. Compare to theoretical distribution

Expected distribution for fair die:

P(streak length k) = (1/6)^(k-1) × (5/6)

Length | Probability | Per 1000 streaks
-------|-------------|------------------
1 | 83.3% | 833
2 | 13.9% | 139
3 | 2.3% | 23
4 | 0.4% | 4
5+ | 0.1% | 1

What it detects:

  • Sticky behavior (too many long streaks)
  • Anti-sticky behavior (too few repeats)
  • Memory in the random source

Interpretation:

  • High p-value: Streak lengths are normal
  • Low p-value: Streaks are suspiciously long or short

4. Transition Matrix Test

What it tests: Does the next value depend on the current value?

How it works:

  1. Build 6×6 matrix counting transitions (e.g., "1 followed by 4")
  2. For each row, run chi-squared test against uniform distribution
  3. Report 6 separate p-values (one per starting value)

Expected behavior:

P(next = j | current = i) = 1/6 for all i, j

Transition matrix should look roughly like:
→ 1 2 3 4 5 6
1 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
2 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
...

What it detects:

  • Markov dependencies ("3 is often followed by 5")
  • Mechanical bias in physical dice
  • Pseudo-random generator weaknesses

Why 6 separate tests instead of one:

  • Raw matrix chi-squared violates independence assumptions
  • Row-by-row testing is statistically valid
  • Pinpoints which transitions are problematic

Interpretation:

  • All rows p > 0.05: No transition bias detected
  • One row p < 0.01: That starting value has biased follow-ups

5. Serial Pair Test

What it tests: Do all consecutive pairs appear with equal frequency?

How it works:

  1. Extract all consecutive pairs: (1,4), (4,2), (2,6), ...
  2. Count occurrences of each of 36 possible pairs
  3. Chi-squared test against expected frequency (n/36 each)

Expected behavior:

36 unique pairs, each with probability 1/36 ≈ 2.78%

In 10,000 rolls → ~9,999 pairs → ~278 expected per pair

What it detects:

  • Subtle sequential bias invisible to single-value tests
  • "1 is more likely to follow 3" patterns
  • State-dependent behavior

Difference from transition matrix:

  • Transition matrix: conditional probabilities (P(next|current))
  • Serial pair test: joint probabilities (P(current AND next))
  • Both are valuable; they catch different anomalies

Interpretation:

  • High p-value: All pairs appear equally often
  • Low p-value: Some pairs are over/under-represented

6. Shannon Entropy

What it tests: How much information content is in the output?

How it works:

  1. Calculate frequency of each face: p_i = count_i / total
  2. Compute entropy: H = -Σ p_i × log₂(p_i)
  3. Compare to theoretical maximum

Theoretical values:

Maximum entropy for 6-sided die:
H_max = log₂(6) ≈ 2.585 bits

This occurs when all faces are equally likely (p = 1/6)

What it detects:

  • Low entropy = predictable output
  • Concentrated distribution = fewer effective outcomes
  • Information loss from bias

Interpretation:

Observed EntropyMeaning
~2.585 bitsPerfect — all outcomes equally likely
2.4 - 2.58 bitsGood — minor variation
< 2.4 bitsConcerning — some outcomes dominate
< 2.0 bitsSerious — significant bias present

Note: Entropy is always shown as a metric, not a pass/fail test, because it provides intuitive understanding of randomness quality.


7. Autocorrelation

What it tests: Is there correlation between values at different time lags?

How it works:

  1. For each lag k (1, 2, 3, ... 20):
  2. Compute correlation between sequence and itself shifted by k
  3. Check if correlations fall within expected bounds

Expected behavior:

For truly random data:
- Autocorrelation at all lags ≈ 0
- 95% confidence bounds: ±1.96/√n

For n = 1000:
- Bounds ≈ ±0.062
- Values outside bounds suggest correlation

What it detects:

  • Periodic patterns (every 10th value repeats)
  • Trending behavior
  • Poor PRNG with short cycles

Important caveat: Die values (1-6) are categorical, not continuous. Autocorrelation assumes numeric distance matters, but "1 → 6" isn't meaningfully different from "2 → 3" in randomness terms.

Interpretation:

  • Use for visualization and trend detection
  • Don't rely on it as primary randomness test
  • Transition matrix is more appropriate for sequential independence

Rolling Windows

Tests run on multiple window sizes to catch both short-term and long-term anomalies:

WindowPurposeUpdate Frequency
100Short-term fluctuationsEvery round
1,000Medium-term patternsEvery round
10,000Stable statisticsEvery 10 rounds
100,000Long-term validationEvery 100 rounds

Why multiple windows?

  • Small windows: Responsive to recent changes, but noisy
  • Large windows: Stable statistics, but slow to detect new problems
  • Combined view: Best of both worlds

Bonferroni Correction

When running multiple tests, false positives accumulate.

Problem:

  • 6 tests at α = 0.05
  • Probability of at least one false positive: 1 - (0.95)⁶ ≈ 26%

Solution: Bonferroni correction

  • Adjusted α = 0.05 / 6 ≈ 0.0083
  • Each test must achieve p < 0.0083 to be "significant"
  • Family-wise error rate stays at 5%

Dashboard shows:

  • Individual test p-values
  • Bonferroni-corrected overall status
  • Whether any test is "significant" after correction

What These Tests Cannot Tell You

Statistical tests have limitations:

  1. Cannot prove randomness — only detect specific types of non-randomness
  2. Cannot detect all manipulation — adversary might pass all tests
  3. Cannot predict future values — that's the point
  4. Will occasionally fail — 5% false positive rate is expected

The dashboard is evidence of quality, not proof of perfection.


Further Reading